The rise of ghost and fraudulent job postings has become a major problem for job platforms, with recent industry reports estimating that up to 20–30% of online job ads show signs of suspicious or deceptive activity. These posts create harmful experiences for users: wasting applicants’ time, exposing them to phishing attempts, and eroding trust in the platform. They also hurt employers, who depend on credible marketplaces to attract qualified candidates. For job-posting websites like LinkedIn or Indeed, the challenge is scale: millions of posts go live every month, making manual review impossible. Predictive analytics offers a way to proactively identify high-risk postings by learning patterns that distinguish legitimate jobs from fraudulent or “ghost” listings (those that are never actually reviewed or filled). ### Data Issues:
We will be creating a logistic regression, decision tree, SVM, random forest, KNN, and ANN model. We will also be created a stacked model that combines this individual models. This will be accomplished using a decision tree model (as the second level model put on top of these other individual models). We want to use a decision tree model as it will combine these models without aggregating. When we aggregate models, we will get a model that in an average of the models (it will not be better than the best model). With the decision tree, the goal is to take the best of each model to produce a superior model that outperforms the individual models.
By building and comparing models such as logistic regression, decision tree, SVM, random forest, KNN, ANN, and stacked models, we can evaluate which approach most effectively reduces false negatives, because missing a fraudulent post is far more damaging than mistakenly flagging a real one. Ultimately, the goal of this project is to design a model that improves platform safety, protects job seekers, and strengthens the integrity of online hiring ecosystems.
We need to load the data into an object (job) so that we
can interact with it. We don’t want to set stringsAsFactors
to true, since there are some string data that should not be converted
to factors that we will need to deal with while cleaning the data.
# Let's store the data in the job object so we can interact with it
job <- read.csv("fake_job_postings.csv")
We need get a sense of the data, before we can start cleaning it. It
is important to remove any columns that are unnecessary or would be data
we should not use in our prediction models (e.g., data that would not be
available to make predictions). It is also a good time to deal with NA
data (if there are any). Also, we deal with the variables with too many
factors (>40) do not behave well in certain models, so cosolidating
them here will be important. Also, for KNN and ANN model, it is
important to dummify and scale the data. So, we will use
job for the logistic regression, SVM, and random forest
models, use job_dummy decision tree model (the
non-dummified version will not produce a model based on our data) and
job_scaled for the KNN and ANN models. After cleaning the
data, it is good to double check the data to ensure the desired results
are achieved.
str(job) # Let's get a sense of columns and data types
## 'data.frame': 17880 obs. of 18 variables:
## $ job_id : int 1 2 3 4 5 6 7 8 9 10 ...
## $ title : chr "Marketing Intern" "Customer Service - Cloud Video Production" "Commissioning Machinery Assistant (CMA)" "Account Executive - Washington DC" ...
## $ location : chr "US, NY, New York" "NZ, , Auckland" "US, IA, Wever" "US, DC, Washington" ...
## $ department : chr "Marketing" "Success" "" "Sales" ...
## $ salary_range : chr "" "" "" "" ...
## $ company_profile : chr "We're Food52, and we've created a groundbreaking and award-winning cooking site. We support, connect, and celeb"| __truncated__ "90 Seconds, the worlds Cloud Video Production Service.90 Seconds is the worlds Cloud Video Production Service e"| __truncated__ "Valor Services provides Workforce Solutions that meet the needs of companies across the Private Sector, with a "| __truncated__ "Our passion for improving quality of life through geography is at the heart of everything we do. Esri’s geogra"| __truncated__ ...
## $ description : chr "Food52, a fast-growing, James Beard Award-winning online food community and crowd-sourced and curated recipe hu"| __truncated__ "Organised - Focused - Vibrant - Awesome!Do you have a passion for customer service? Slick typing skills? Maybe "| __truncated__ "Our client, located in Houston, is actively seeking an experienced Commissioning Machinery Assistant that posse"| __truncated__ "THE COMPANY: ESRI – Environmental Systems Research InstituteOur passion for improving quality of life through g"| __truncated__ ...
## $ requirements : chr "Experience with content management systems a major plus (any blogging counts!)Familiar with the Food52 editoria"| __truncated__ "What we expect from you:Your key responsibility will be to communicate with the client, 90 Seconds team and fre"| __truncated__ "Implement pre-commissioning and commissioning procedures for rotary equipment.Execute all activities with subco"| __truncated__ "EDUCATION: Bachelor’s or Master’s in GIS, business administration, or a related field, or equivalent work exper"| __truncated__ ...
## $ benefits : chr "" "What you will get from usThrough being part of the 90 Seconds team you will gain:experience working on projects"| __truncated__ "" "Our culture is anything but corporate—we have a collaborative, creative environment; phone directories organize"| __truncated__ ...
## $ telecommuting : int 0 0 0 0 0 0 0 0 0 0 ...
## $ has_company_logo : int 1 1 1 1 1 0 1 1 1 1 ...
## $ has_questions : int 0 0 0 0 1 0 1 1 1 0 ...
## $ employment_type : chr "Other" "Full-time" "" "Full-time" ...
## $ required_experience: chr "Internship" "Not Applicable" "" "Mid-Senior level" ...
## $ required_education : chr "" "" "" "Bachelor's Degree" ...
## $ industry : chr "" "Marketing and Advertising" "" "Computer Software" ...
## $ function. : chr "Marketing" "Customer Service" "" "Sales" ...
## $ fraudulent : int 0 0 0 0 0 0 0 0 0 0 ...
summary(job)
## job_id title location department
## Min. : 1 Length:17880 Length:17880 Length:17880
## 1st Qu.: 4471 Class :character Class :character Class :character
## Median : 8940 Mode :character Mode :character Mode :character
## Mean : 8940
## 3rd Qu.:13410
## Max. :17880
## salary_range company_profile description requirements
## Length:17880 Length:17880 Length:17880 Length:17880
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## benefits telecommuting has_company_logo has_questions
## Length:17880 Min. :0.0000 Min. :0.0000 Min. :0.0000
## Class :character 1st Qu.:0.0000 1st Qu.:1.0000 1st Qu.:0.0000
## Mode :character Median :0.0000 Median :1.0000 Median :0.0000
## Mean :0.0429 Mean :0.7953 Mean :0.4917
## 3rd Qu.:0.0000 3rd Qu.:1.0000 3rd Qu.:1.0000
## Max. :1.0000 Max. :1.0000 Max. :1.0000
## employment_type required_experience required_education industry
## Length:17880 Length:17880 Length:17880 Length:17880
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## function. fraudulent
## Length:17880 Min. :0.00000
## Class :character 1st Qu.:0.00000
## Mode :character Median :0.00000
## Mean :0.04843
## 3rd Qu.:0.00000
## Max. :1.00000
We want to remove columns that are irrelevant or that we will not have access to when using this model to predict the fraudulent outcome status. We also want to be careful about how we modify data as changing data into factors or other simplifications can result in information being lost (which will hurt the robustness of our models).
job$job_id <- NULL # This is not necessary, so let's delete it
# All of these should be treated like factors instead of strings
job$location <- as.factor(job$location)
job$department <- as.factor(job$department)
job$salary_range <- as.factor(job$salary_range)
job$employment_type <- as.factor(job$employment_type)
job$required_experience <- as.factor(job$required_experience)
job$required_education <- as.factor(job$required_education)
job$industry <- as.factor(job$industry)
job$function. <- as.factor(job$function.)
Here we parse the words in the benefits columns and transforming them into binary variables.
# Create binary flags for top 5 signals in benefits
job$benefits_pipe <- grepl("\\|", job$benefits) # pipe symbol
job$benefits_hash <- grepl("#", job$benefits) # hash symbol
job$benefits_bonus <- grepl("bonus", job$benefits, ignore.case = TRUE) # keyword: bonus
job$benefits_apply <- grepl("apply|contact", job$benefits, ignore.case = TRUE) # keywords: apply/contact
job$benefits_benefits <- grepl("benefits", job$benefits, ignore.case = TRUE) # keyword: benefits
# Convert to numeric 0/1 if needed
job[, c("benefits_pipe", "benefits_hash", "benefits_bonus",
"benefits_apply", "benefits_benefits")] <-
lapply(job[, c("benefits_pipe", "benefits_hash", "benefits_bonus",
"benefits_apply", "benefits_benefits")], as.numeric)
job$benefits <- nchar(job$benefits) # Number of characters (length of benefit section) might be beneficial to our prediction model
job$benefits <- ifelse(is.na(job$benefits), mean(job$benefits, na.rm = T), job$benefits) # deal with NAs by replacing with mean value
Here we parse the words in the title columns and transforming them into binary variables.
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(stringr)
job <- job %>%
mutate(
slash_present = if_else(str_detect(title, "/"), 1, 0), # slash
backslash_present = if_else(str_detect(title, "\\\\"), 1, 0), # backslash
amp_present = if_else(str_detect(title, "&"), 1, 0), # ampersand
exclam_present = if_else(str_detect(title, "!"), 1, 0), # exclamation
dash_present = if_else(str_detect(title, "-"), 1, 0), # dash/hyphen
multiple_spaces = if_else(str_detect(title, " {2,}"), 1, 0),# double spaces
parens_present = if_else(str_detect(title, "\\(|\\)"), 1, 0), # parentheses
numbers_present = if_else(str_detect(title, "[0-9]"), 1, 0) # any digits
)
job$title <- nchar(job$title) # Number of characters (length of title section) might be beneficial to our prediction model
Here we parse the words in the requirements columns and transforming them into binary variables.
# Make everything lowercase once for speed
job <- job %>%
mutate(req_clean = tolower(requirements))
# 1. Requirements missing or very short (< 10 words)
job$req_missing_or_short <- as.integer(
is.na(job$req_clean) | str_count(job$req_clean, "\\w+") < 10
)
# 2. Heavy engineering / industrial terms
eng_terms <- c(
"asme", "api", "ansi", "pressure vessel", "heat exchanger",
"pumps", "compressor", "valve", "kilovolt", "kv",
"scada", "plc", "p&id", "process hazard", "piping"
)
job$has_heavy_engineering_terms <- as.integer(
str_detect(job$req_clean, str_c(eng_terms, collapse = "|"))
)
# 3. Certifications / accreditations
cert_terms <- c(
"pmp", "pe", "certified", "license", "licence",
"six sigma", "cfa", "osha", "hazwoper"
)
job$has_certification_terms <- as.integer(
str_detect(job$req_clean, str_c(cert_terms, collapse = "|"))
)
# 4. Years of experience (1+, 2+, “years”, “5-10”, etc.)
job$has_years_experience <- as.integer(
str_detect(job$req_clean, "\\d+\\+?\\s*years?")
)
# 5. Degree required
job$has_degree_required <- as.integer(
str_detect(job$req_clean, "bachelor|degree|required degree|bs |ms |mba")
)
# 6. Software / tools
tool_terms <- c(
"ms office", "excel", "word", "powerpoint",
"sap", "quickbooks", "primavera", "autocad"
)
job$has_tool_software_terms <- as.integer(
str_detect(job$req_clean, str_c(tool_terms, collapse = "|"))
)
# 7. Safety / regulatory language
safety_terms <- c(
"osha", "compliance", "safety procedures", "audit",
"regulations", "hazmat", "hazard"
)
job$has_safety_regulation_terms <- as.integer(
str_detect(job$req_clean, str_c(safety_terms, collapse = "|"))
)
# 8. Heavy bullet lists / long enumerations
job$req_contains_heavy_lists <- as.integer(
str_count(job$req_clean, "- |•|\\*|\\n") > 10
)
# 9. Title–requirement mismatch flag
# (Engineering terms inside non-engineering jobs)
job$req_title_mismatch <- with(job, as.integer(
str_detect(req_clean, str_c(eng_terms, collapse = "|")) &
!str_detect(tolower(title), "engineer|technician|operator|mechanic")
))
job$req_clean <- NULL # No longer need this
job$requirements <- nchar(job$requirements) # Number of characters (length of requirement section) might be beneficial to our prediction model
Here we parse the words in the description columns and transforming them into binary variables.
# Create 10 binary feature columns from job$description
job$has_urgent_language <- as.integer(grepl(
"urgent|apply now|immediate start|asap|start immediately",
job$description, ignore.case = TRUE))
job$has_no_experience_needed <- as.integer(grepl(
"no experience|training provided|any background|anyone can apply",
job$description, ignore.case = TRUE))
job$has_salary_info <- as.integer(grepl(
"\\$|salary|per hour|per annum|k", # detects any salary mention
job$description, ignore.case = TRUE))
job$has_qualification_terms <- as.integer(grepl(
"bachelor|degree|certificate|qualification|experience required",
job$description, ignore.case = TRUE))
job$has_benefits_stated <- as.integer(grepl(
"benefits|health insurance|401k|superannuation|paid time off|leave",
job$description, ignore.case = TRUE))
job$has_technical_terms <- as.integer(grepl(
"sql|python|excel|jira|crm|compliance|financial analysis",
job$description, ignore.case = TRUE))
job$has_contact_number_or_whatsapp <- as.integer(grepl(
"\\b\\d{3}[- ]?\\d{3}[- ]?\\d{4}\\b|whatsapp",
job$description, ignore.case = TRUE))
job$has_company_language <- as.integer(grepl(
"team|mission|vision|culture|values|our company",
job$description, ignore.case = TRUE))
job$has_commission_only_language <- as.integer(grepl(
"commission only|high earning|earn up to|unlimited income",
job$description, ignore.case = TRUE))
job$description <- nchar(job$description) # Number of characters (length of description section) might be beneficial to our prediction model
Here we parse the words in the company profile columns and transforming them into binary variables.
library(dplyr)
library(stringr)
job <- job %>%
mutate(
has_referral_bonus = ifelse(str_detect(tolower(company_profile), "referral bonus|bonus for referral"), 1, 0),
has_signing_bonus = ifelse(str_detect(tolower(company_profile), "signing bonus|bonus by"), 1, 0),
has_perks = ifelse(str_detect(tolower(company_profile), "perks|corporate discounts|benefits"), 1, 0),
has_relocation = ifelse(str_detect(tolower(company_profile), "relocation|out of town candidates|move assistance"), 1, 0)
)
# Preview
#job %>% select(company_profile, has_referral_bonus, has_signing_bonus, has_perks, has_relocation) %>% head()
job$company_profile <- nchar(job$company_profile) # Number of characters (length of company profile section) might be beneficial to our prediction model
# With 84% of the values being null, we figured it best to transform the data into binary values of whether salary is known or not. It did not make sense to try to replace NA values with the mean since we only have values for 16% of the data.
# Count blanks
sum(job$salary_range == "", na.rm = TRUE)
## [1] 15012
job$salary_known <- ifelse(is.na(job$salary_range) | job$salary_range == "", 0, 1)
job$salary_known <- factor(job$salary_known)
table(job$salary_known)
##
## 0 1
## 15012 2868
# 1️⃣ Convert to character, trim, lowercase
job$department_clean <- tolower(trimws(as.character(job$department)))
# 2️⃣ Replace blanks or NAs with "NA"
job$department_clean[job$department_clean == "" | is.na(job$department_clean)] <- "NA"
# 3️⃣ Merge obvious duplicates / similar departments
merge_list <- list(
"customer service" = c("customer service", "customer service ", "cs", "csd relay"),
"it" = c("it", "information technology", "it services"),
"marketing" = c("marketing", "performance marketing"),
"sales" = c("sales", "sales and marketing"),
"administration" = c("admin", "administrative", "administration", "administration support", "admin/clerical", "admin - clerical"),
"accounting" = c("accounting", "accounting/finance", "accounting and finance", "accounting & finance", "accounting/payroll"),
"engineering" = c("engineering", "engineering "),
"hr" = c("hr", "human resources"),
"product" = c("product", "product development", "product innovation", "product team"),
"operations" = c("operations", "oil & energy", "oil and gas", "maintenance"),
"customer_facing" = c("client services", "customer success", "customer support", "content", "creative services"),
"business_management" = c("business", "business development", "management", "project management"),
"tech_development" = c("software development", ".net", ".net development", "tech", "technical", "technical support", "design", "development"),
"education_training" = c("didactics", "education", "editorial"),
"operations_logistics" = c("warehouse", "voyageur medical transportation")
)
# Apply merges
for (new_name in names(merge_list)) {
job$department_clean[job$department_clean %in% merge_list[[new_name]]] <- new_name
}
# 4️⃣ Group rare departments (≤10 occurrences) into "other"
dept_counts <- table(job$department_clean)
job$department_clean <- ifelse(dept_counts[job$department_clean] > 10,
job$department_clean,
"other")
# 5️⃣ Convert to factor
job$department_clean <- factor(job$department_clean)
# Check resulting counts
table(job$department_clean)
##
## account management accounting administration
## 13 48 95
## all art studio business_management
## 16 11 82
## clerical commercial creative
## 27 18 48
## customer service customer_facing department
## 135 119 23
## digital education_training engagement
## 14 58 13
## engineering finance hr
## 512 74 86
## international growth it legal
## 17 355 24
## marketing merchandising NA
## 443 11 11553
## operations operations_logistics other
## 366 28 2252
## permanent product production
## 13 185 33
## qa r&d retail
## 18 55 46
## sales squiz support
## 594 20 19
## tech_development technology
## 377 79
# Convert to character and trim whitespace
job$industry_clean <- trimws(as.character(job$industry))
# Replace blanks or NA with "NA"
job$industry_clean[job$industry_clean == "" | is.na(job$industry_clean)] <- "NA"
# Create industry groupings
industry_map <- list(
"Technology & Software" = c(
"Computer Software", "Information Technology and Services", "Internet",
"Computer Games", "Computer Hardware", "Computer Networking",
"Computer & Network Security", "Semiconductors", "Information Services",
"Program Development", "Nanotechnology"
),
"Healthcare, Wellness & Life Sciences" = c(
"Healthcare, Wellness & Life Sciences", "Hospital & Health Care",
"Medical Practice", "Mental Health Care", "Health, Wellness and Fitness",
"Pharmaceuticals", "Biotechnology", "Medical Devices", "Veterinary"
),
"Finance, Banking & Insurance" = c(
"Financial Services", "Banking", "Insurance", "Investment Management",
"Venture Capital & Private Equity", "Capital Markets", "Investment Banking",
"Accounting"
),
"Business Administration" = c(
"Staffing and Recruiting", "Human Resources", "Executive Office"
),
"Consulting, Professional Services & Legal" = c(
"Management Consulting", "Legal Services", "Law Practice", "Government Relations",
"Alternative Dispute Resolution", "Individual & Family Services"
),
"Consumer Goods, Retail & Fashion" = c(
"Consumer Goods", "Consumer Services", "Retail", "Apparel & Fashion",
"Cosmetics", "Sporting Goods", "Luxury Goods & Jewelry", "Textiles",
"Furniture", "Consumer Electronics", "Wholesale"
),
"Media, Entertainment & Creative" = c(
"Public Relations and Communications", "Media Production", "Broadcast Media",
"Publishing", "Music", "Entertainment", "Animation", "Graphic Design",
"Design", "Photography", "Writing and Editing", "Motion Pictures and Film",
"Market Research", "Online Media", "Performing Arts", "Sports",
"Marketing and Advertising"
),
"Hospitality, Travel & Leisure" = c(
"Hospitality", "Leisure, Travel & Tourism", "Restaurants", "Gambling & Casinos",
"Airlines/Aviation", "Events Services", "Facilities Services"
),
"Education & Training" = c(
"Education Management", "E-Learning", "Primary/Secondary Education",
"Higher Education", "Professional Training & Coaching", "Libraries",
"Museums and Institutions", "Translation and Localization", "Research"
),
"Manufacturing & Industrial" = c(
"Electrical/Electronic Manufacturing", "Mechanical or Industrial Engineering",
"Industrial Automation", "Machinery", "Chemicals", "Plastics",
"Printing", "Packaging and Containers", "Shipbuilding", "Civil Engineering",
"Automotive", "Business Supplies and Equipment"
),
"Energy, Utilities & Environment" = c(
"Oil & Energy", "Renewables & Environment", "Utilities", "Environmental Services",
"Mining & Metals", "Wireless", "Telecommunications"
),
"Transportation, Logistics & Supply Chain" = c(
"Logistics and Supply Chain", "Warehousing", "Transportation/Trucking/Railroad",
"Package/Freight Delivery", "Maritime", "Import and Export",
"International Trade and Development", "Outsourcing/Offshoring"
),
"Agriculture, Food & Natural Resources" = c(
"Food & Beverages", "Food Production", "Farming", "Fishery", "Ranching",
"Wine and Spirits"
),
"Real Estate & Construction" = c(
"Construction", "Real Estate", "Commercial Real Estate", "Building Materials",
"Architecture & Planning"
),
"Government, Nonprofit & Public Sector" = c(
"Government Administration", "Nonprofit Organization Management",
"Civic & Social Organization", "Public Policy", "Public Safety",
"Law Enforcement", "Philanthropy", "Fund-Raising", "Religious Institutions"
),
"Defense, Security & Aerospace" = c(
"Defense & Space", "Military", "Security and Investigations", "Aviation & Aerospace"
)
)
# Apply the mapping
for (group in names(industry_map)) {
job$industry_clean[job$industry_clean %in% industry_map[[group]]] <- group
}
# Optional: group any remaining very rare industries (≤10 occurrences) into "other"
industry_counts <- table(job$industry_clean)
job$industry_clean <- ifelse(industry_counts[job$industry_clean] > 10,
job$industry_clean,
"other")
# Convert to factor
job$industry_clean <- factor(job$industry_clean)
# Check resulting counts
table(job$industry_clean)
##
## Agriculture, Food & Natural Resources
## 146
## Business Administration
## 243
## Consulting, Professional Services & Legal
## 267
## Consumer Goods, Retail & Fashion
## 889
## Defense, Security & Aerospace
## 65
## Education & Training
## 1034
## Energy, Utilities & Environment
## 727
## Finance, Banking & Insurance
## 1189
## Government, Nonprofit & Public Sector
## 199
## Healthcare, Wellness & Life Sciences
## 814
## Hospitality, Travel & Leisure
## 465
## Manufacturing & Industrial
## 336
## Media, Entertainment & Creative
## 1486
## NA
## 4903
## Real Estate & Construction
## 425
## Technology & Software
## 4436
## Transportation, Logistics & Supply Chain
## 256
# Convert to character and trim whitespace
job$function_clean <- trimws(as.character(job$function.)) # assuming your column is 'function.'
# Replace blanks or NA with "NA"
job$function_clean[job$function_clean == "" | is.na(job$function_clean)] <- "NA"
# Create function groupings
function_map <- list(
"Marketing & Advertising" = c("Marketing & Advertising", "Marketing", "Advertising", "Public Relations"),
"Analytics & Business Development" = c("Business Development", "Data Analyst", "Business Analyst"),
"Sales & Customer Service & IT" = c("Customer Service", "Sales", "Information Technology"),
"Management & Leadership" = c("Management", "Strategy/Planning", "General Business", "Administrative"),
"Engineering & Production" = c("Engineering", "Production", "Manufacturing", "Product & Project", "Product Management", "Project Management"),
"Healthcare & Science" = c("Health Care Provider", "Science"),
"Supply Chain & Logistics" = c("Supply Chain", "Purchasing", "Distribution"),
"Finance & Accounting" = c("Finance", "Accounting/Auditing", "Financial Analyst"),
"Human Resources & Training" = c("Human Resources", "Training", "Consulting"),
"Legal & Compliance" = c("Legal", "Quality Assurance"),
"Arts" = c("Art/Creative", "Writing/Editing", "Design")
)
# Apply the mapping
for (group in names(function_map)) {
job$function_clean[job$function_clean %in% function_map[[group]]] <- group
}
# Optional: group any remaining very rare functions into "other"
function_counts <- table(job$function_clean)
job$function_clean <- ifelse(function_counts[job$function_clean] > 10,
job$function_clean,
"other")
# Convert to factor
job$function_clean <- factor(job$function_clean)
# Check resulting counts
table(job$function_clean)
##
## Analytics & Business Development Arts
## 394 604
## Education Engineering & Production
## 325 1835
## Finance & Accounting Healthcare & Science
## 417 352
## Human Resources & Training Legal & Compliance
## 387 158
## Management & Leadership Marketing & Advertising
## 1061 996
## NA Other
## 6455 325
## Research Sales & Customer Service & IT
## 50 4446
## Supply Chain & Logistics
## 75
# Grab the country
job$loc_country <- substr(job$location, 1, 2)
job$loc_country <- as.factor(job$loc_country)
loc_count <- as.data.frame(table(job$loc_country))
sort(table(job$loc_country))
##
## AL CM CO GH HR JM KH KZ MA PE SD SI SV
## 1 1 1 1 1 1 1 1 1 1 1 1 1
## UG AM BD CL IS KW LK SK TN ZM VI NI TT
## 1 2 2 2 2 2 2 2 2 2 3 4 4
## TW VN CZ LV KE RS NO AR BH BY LU PA IQ
## 4 4 6 6 7 7 8 9 9 9 9 9 10
## KR NG TH CY ID MT UA AT HU MU CH CN SA
## 10 10 10 11 13 13 13 14 14 14 15 15 15
## BG TR MX PT JP RU MY QA LT PK FI IT BR
## 17 17 18 18 20 20 21 21 23 27 29 31 36
## ZA DK RO SE EG AE ES FR EE IL PL HK SG
## 40 42 46 49 52 54 66 70 72 72 76 77 80
## IE BE NL PH AU IN NZ DE CA GR GB US
## 114 117 127 132 214 276 333 346 383 457 940 2384 10656
# tabulate counts by country
tab <- table(job$loc_country)
# map counts back to rows
job$loc_country_val <- as.integer(tab[ as.character(job$loc_country) ])
# create new column: keep country if count > 10, else "Other"
job$loc_country_new <- ifelse(job$loc_country_val > 10, as.character(job$loc_country), "Other")
# Convert to factor first
job$loc_country_new[is.na(job$loc_country_new) | job$loc_country_new == ""] <- "NA"
job$loc_country_new <- factor(job$loc_country_new)
# We need to remove these columns since we have cleaned them and replaced them with a cleaned version
job$loc_country_val <- NULL
job$loc_country <- NULL
job$department <- NULL
job$salary_range <- NULL
job$industry <- NULL
job$location <- NULL
job$function. <- NULL
str(job)
## 'data.frame': 17880 obs. of 52 variables:
## $ title : int 16 41 39 33 19 16 21 32 10 39 ...
## $ company_profile : int 885 1286 879 614 1628 0 881 1025 1364 684 ...
## $ description : int 905 2077 355 2600 1520 3418 433 2488 75 1219 ...
## $ requirements : int 852 1433 1363 1429 757 0 764 368 359 769 ...
## $ benefits : num 0 1292 0 782 21 ...
## $ telecommuting : int 0 0 0 0 0 0 0 0 0 0 ...
## $ has_company_logo : int 1 1 1 1 1 0 1 1 1 1 ...
## $ has_questions : int 0 0 0 0 1 0 1 1 1 0 ...
## $ employment_type : Factor w/ 6 levels "","Contract",..: 4 3 1 3 3 1 3 1 3 5 ...
## $ required_experience : Factor w/ 8 levels "","Associate",..: 6 8 1 7 7 1 7 1 2 4 ...
## $ required_education : Factor w/ 14 levels "","Associate Degree",..: 1 1 1 3 3 1 7 1 1 6 ...
## $ fraudulent : int 0 0 0 0 0 0 0 0 0 0 ...
## $ benefits_pipe : num 0 0 0 0 0 0 0 0 0 0 ...
## $ benefits_hash : num 0 0 0 0 0 0 0 0 0 0 ...
## $ benefits_bonus : num 0 0 0 0 0 0 0 0 0 0 ...
## $ benefits_apply : num 0 1 0 0 0 0 0 0 0 0 ...
## $ benefits_benefits : num 0 0 0 0 1 0 1 0 0 0 ...
## $ slash_present : num 0 0 0 0 0 0 1 0 0 0 ...
## $ backslash_present : num 0 0 0 0 0 0 0 0 0 0 ...
## $ amp_present : num 0 0 0 0 0 0 0 0 0 0 ...
## $ exclam_present : num 0 0 0 0 0 0 0 0 0 0 ...
## $ dash_present : num 0 1 0 1 0 0 0 0 0 1 ...
## $ multiple_spaces : num 0 0 0 0 0 0 0 1 0 0 ...
## $ parens_present : num 0 0 1 0 0 0 1 0 0 0 ...
## $ numbers_present : num 0 0 0 0 0 0 0 0 0 0 ...
## $ req_missing_or_short : int 0 0 0 0 0 1 0 0 0 0 ...
## $ has_heavy_engineering_terms : int 0 0 0 0 0 0 0 0 0 0 ...
## $ has_certification_terms : int 1 1 1 1 1 0 1 1 1 1 ...
## $ has_years_experience : int 0 0 0 1 0 0 0 0 0 0 ...
## $ has_degree_required : int 1 0 1 1 1 0 1 0 0 1 ...
## $ has_tool_software_terms : int 1 1 0 1 0 0 0 0 0 1 ...
## $ has_safety_regulation_terms : int 0 0 0 0 0 0 0 0 0 0 ...
## $ req_contains_heavy_lists : int 0 0 0 0 0 0 0 0 0 0 ...
## $ req_title_mismatch : int 0 0 0 0 0 0 0 0 0 0 ...
## $ has_urgent_language : int 0 0 0 0 0 0 0 0 0 1 ...
## $ has_no_experience_needed : int 0 0 0 0 0 0 0 0 0 0 ...
## $ has_salary_info : int 1 1 1 1 1 1 1 1 0 1 ...
## $ has_qualification_terms : int 0 0 0 0 0 1 0 0 0 0 ...
## $ has_benefits_stated : int 0 0 0 1 0 0 0 0 0 0 ...
## $ has_technical_terms : int 0 1 0 1 1 1 0 1 0 0 ...
## $ has_contact_number_or_whatsapp: int 0 0 0 0 0 0 0 0 0 0 ...
## $ has_company_language : int 1 1 1 1 1 1 1 1 0 1 ...
## $ has_commission_only_language : int 0 0 0 0 0 0 0 0 0 0 ...
## $ has_referral_bonus : num 0 0 0 0 0 0 0 0 0 0 ...
## $ has_signing_bonus : num 0 0 0 0 0 0 0 0 0 0 ...
## $ has_perks : num 0 0 0 1 0 0 0 0 0 0 ...
## $ has_relocation : num 0 0 0 0 0 0 0 0 0 0 ...
## $ salary_known : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 2 1 1 1 ...
## $ department_clean : Factor w/ 38 levels "account management",..: 22 27 24 34 24 24 27 24 24 24 ...
## $ industry_clean : Factor w/ 17 levels "Agriculture, Food & Natural Resources",..: 14 13 14 16 10 14 13 14 16 8 ...
## $ function_clean : Factor w/ 15 levels "Analytics & Business Development",..: 10 14 11 14 6 11 9 11 11 14 ...
## $ loc_country_new : Factor w/ 50 levels "AE","AT","AU",..: 49 35 49 49 49 49 11 49 49 49 ...
summary(job)
## title company_profile description requirements
## Min. : 3.00 Min. : 0.0 Min. : 3 Min. : 0.0
## 1st Qu.: 19.00 1st Qu.: 138.0 1st Qu.: 607 1st Qu.: 146.0
## Median : 25.00 Median : 570.0 Median : 1017 Median : 467.0
## Mean : 28.53 Mean : 620.9 Mean : 1218 Mean : 590.1
## 3rd Qu.: 35.00 3rd Qu.: 879.0 3rd Qu.: 1586 3rd Qu.: 820.0
## Max. :142.00 Max. :6178.0 Max. :14907 Max. :10864.0
##
## benefits telecommuting has_company_logo has_questions
## Min. : 0.0 Min. :0.0000 Min. :0.0000 Min. :0.0000
## 1st Qu.: 0.0 1st Qu.:0.0000 1st Qu.:1.0000 1st Qu.:0.0000
## Median : 45.0 Median :0.0000 Median :1.0000 Median :0.0000
## Mean : 208.9 Mean :0.0429 Mean :0.7953 Mean :0.4917
## 3rd Qu.: 294.0 3rd Qu.:0.0000 3rd Qu.:1.0000 3rd Qu.:1.0000
## Max. :4429.0 Max. :1.0000 Max. :1.0000 Max. :1.0000
##
## employment_type required_experience required_education
## : 3471 :7050 :8105
## Contract : 1524 Mid-Senior level:3809 Bachelor's Degree :5145
## Full-time:11620 Entry level :2697 High School or equivalent:2080
## Other : 227 Associate :2297 Unspecified :1397
## Part-time: 797 Not Applicable :1116 Master's Degree : 416
## Temporary: 241 Director : 389 Associate Degree : 274
## (Other) : 522 (Other) : 463
## fraudulent benefits_pipe benefits_hash benefits_bonus
## Min. :0.00000 Min. :0.000000 Min. :0.00000 Min. :0.00000
## 1st Qu.:0.00000 1st Qu.:0.000000 1st Qu.:0.00000 1st Qu.:0.00000
## Median :0.00000 Median :0.000000 Median :0.00000 Median :0.00000
## Mean :0.04843 Mean :0.001566 Mean :0.05543 Mean :0.07131
## 3rd Qu.:0.00000 3rd Qu.:0.000000 3rd Qu.:0.00000 3rd Qu.:0.00000
## Max. :1.00000 Max. :1.000000 Max. :1.00000 Max. :1.00000
##
## benefits_apply benefits_benefits slash_present backslash_present
## Min. :0.00000 Min. :0.0000 Min. :0.00000 Min. :0.0000000
## 1st Qu.:0.00000 1st Qu.:0.0000 1st Qu.:0.00000 1st Qu.:0.0000000
## Median :0.00000 Median :0.0000 Median :0.00000 Median :0.0000000
## Mean :0.04234 Mean :0.2012 Mean :0.09659 Mean :0.0001119
## 3rd Qu.:0.00000 3rd Qu.:0.0000 3rd Qu.:0.00000 3rd Qu.:0.0000000
## Max. :1.00000 Max. :1.0000 Max. :1.00000 Max. :1.0000000
##
## amp_present exclam_present dash_present multiple_spaces
## Min. :0.00000 Min. :0.00000 Min. :0.000 Min. :0.000000
## 1st Qu.:0.00000 1st Qu.:0.00000 1st Qu.:0.000 1st Qu.:0.000000
## Median :0.00000 Median :0.00000 Median :0.000 Median :0.000000
## Mean :0.03356 Mean :0.01102 Mean :0.169 Mean :0.009228
## 3rd Qu.:0.00000 3rd Qu.:0.00000 3rd Qu.:0.000 3rd Qu.:0.000000
## Max. :1.00000 Max. :1.00000 Max. :1.000 Max. :1.000000
##
## parens_present numbers_present req_missing_or_short
## Min. :0.00000 Min. :0.00000 Min. :0.0000
## 1st Qu.:0.00000 1st Qu.:0.00000 1st Qu.:0.0000
## Median :0.00000 Median :0.00000 Median :0.0000
## Mean :0.08853 Mean :0.04787 Mean :0.1763
## 3rd Qu.:0.00000 3rd Qu.:0.00000 3rd Qu.:0.0000
## Max. :1.00000 Max. :1.00000 Max. :1.0000
##
## has_heavy_engineering_terms has_certification_terms has_years_experience
## Min. :0.00000 Min. :0.0000 Min. :0.0000
## 1st Qu.:0.00000 1st Qu.:1.0000 1st Qu.:0.0000
## Median :0.00000 Median :1.0000 Median :0.0000
## Mean :0.06549 Mean :0.7724 Mean :0.3444
## 3rd Qu.:0.00000 3rd Qu.:1.0000 3rd Qu.:1.0000
## Max. :1.00000 Max. :1.0000 Max. :1.0000
##
## has_degree_required has_tool_software_terms has_safety_regulation_terms
## Min. :0.0000 Min. :0.0000 Min. :0.00000
## 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.00000
## Median :0.0000 Median :0.0000 Median :0.00000
## Mean :0.4238 Mean :0.3497 Mean :0.03216
## 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:0.00000
## Max. :1.0000 Max. :1.0000 Max. :1.00000
##
## req_contains_heavy_lists req_title_mismatch has_urgent_language
## Min. :0.00000 Min. :0.00000 Min. :0.00000
## 1st Qu.:0.00000 1st Qu.:0.00000 1st Qu.:0.00000
## Median :0.00000 Median :0.00000 Median :0.00000
## Mean :0.02136 Mean :0.06549 Mean :0.07959
## 3rd Qu.:0.00000 3rd Qu.:0.00000 3rd Qu.:0.00000
## Max. :1.00000 Max. :1.00000 Max. :1.00000
##
## has_no_experience_needed has_salary_info has_qualification_terms
## Min. :0.000000 Min. :0.0000 Min. :0.0000
## 1st Qu.:0.000000 1st Qu.:1.0000 1st Qu.:0.0000
## Median :0.000000 Median :1.0000 Median :0.0000
## Mean :0.007159 Mean :0.9705 Mean :0.1006
## 3rd Qu.:0.000000 3rd Qu.:1.0000 3rd Qu.:0.0000
## Max. :1.000000 Max. :1.0000 Max. :1.0000
##
## has_benefits_stated has_technical_terms has_contact_number_or_whatsapp
## Min. :0.00000 Min. :0.0000 Min. :0.00000
## 1st Qu.:0.00000 1st Qu.:0.0000 1st Qu.:0.00000
## Median :0.00000 Median :0.0000 Median :0.00000
## Mean :0.09077 Mean :0.2698 Mean :0.00179
## 3rd Qu.:0.00000 3rd Qu.:1.0000 3rd Qu.:0.00000
## Max. :1.00000 Max. :1.0000 Max. :1.00000
##
## has_company_language has_commission_only_language has_referral_bonus
## Min. :0.0000 Min. :0.00000 Min. :0.000000
## 1st Qu.:0.0000 1st Qu.:0.00000 1st Qu.:0.000000
## Median :1.0000 Median :0.00000 Median :0.000000
## Mean :0.6383 Mean :0.00453 Mean :0.006432
## 3rd Qu.:1.0000 3rd Qu.:0.00000 3rd Qu.:0.000000
## Max. :1.0000 Max. :1.00000 Max. :1.000000
##
## has_signing_bonus has_perks has_relocation salary_known
## Min. :0.000000 Min. :0.00000 Min. :0.000000 0:15012
## 1st Qu.:0.000000 1st Qu.:0.00000 1st Qu.:0.000000 1: 2868
## Median :0.000000 Median :0.00000 Median :0.000000
## Mean :0.003132 Mean :0.05872 Mean :0.009955
## 3rd Qu.:0.000000 3rd Qu.:0.00000 3rd Qu.:0.000000
## Max. :1.000000 Max. :1.00000 Max. :1.000000
##
## department_clean industry_clean
## NA :11553 NA :4903
## other : 2252 Technology & Software :4436
## sales : 594 Media, Entertainment & Creative :1486
## engineering : 512 Finance, Banking & Insurance :1189
## marketing : 443 Education & Training :1034
## tech_development: 377 Consumer Goods, Retail & Fashion: 889
## (Other) : 2149 (Other) :3943
## function_clean loc_country_new
## NA :6455 US :10656
## Sales & Customer Service & IT:4446 GB : 2384
## Engineering & Production :1835 GR : 940
## Management & Leadership :1061 CA : 457
## Marketing & Advertising : 996 DE : 383
## Arts : 604 NA : 346
## (Other) :2483 (Other): 2714
# We can use job for the Logistic Regression, SVM Models, random forest, and Decision Trees
# For the Decision Tree model, we need to dummify the data
job_dummy <- as.data.frame(model.matrix(~ . -1, data = job))
minmax <- function(x) {
(x - min(x)) / (max(x) - min(x))
}
# For the KNN and ANN models, we need to dummify and scale the data
job_scaled <- as.data.frame(lapply(job_dummy, minmax))
# Data for Logistic Regression, SVM, and Random Forest models
str(job)
## 'data.frame': 17880 obs. of 52 variables:
## $ title : int 16 41 39 33 19 16 21 32 10 39 ...
## $ company_profile : int 885 1286 879 614 1628 0 881 1025 1364 684 ...
## $ description : int 905 2077 355 2600 1520 3418 433 2488 75 1219 ...
## $ requirements : int 852 1433 1363 1429 757 0 764 368 359 769 ...
## $ benefits : num 0 1292 0 782 21 ...
## $ telecommuting : int 0 0 0 0 0 0 0 0 0 0 ...
## $ has_company_logo : int 1 1 1 1 1 0 1 1 1 1 ...
## $ has_questions : int 0 0 0 0 1 0 1 1 1 0 ...
## $ employment_type : Factor w/ 6 levels "","Contract",..: 4 3 1 3 3 1 3 1 3 5 ...
## $ required_experience : Factor w/ 8 levels "","Associate",..: 6 8 1 7 7 1 7 1 2 4 ...
## $ required_education : Factor w/ 14 levels "","Associate Degree",..: 1 1 1 3 3 1 7 1 1 6 ...
## $ fraudulent : int 0 0 0 0 0 0 0 0 0 0 ...
## $ benefits_pipe : num 0 0 0 0 0 0 0 0 0 0 ...
## $ benefits_hash : num 0 0 0 0 0 0 0 0 0 0 ...
## $ benefits_bonus : num 0 0 0 0 0 0 0 0 0 0 ...
## $ benefits_apply : num 0 1 0 0 0 0 0 0 0 0 ...
## $ benefits_benefits : num 0 0 0 0 1 0 1 0 0 0 ...
## $ slash_present : num 0 0 0 0 0 0 1 0 0 0 ...
## $ backslash_present : num 0 0 0 0 0 0 0 0 0 0 ...
## $ amp_present : num 0 0 0 0 0 0 0 0 0 0 ...
## $ exclam_present : num 0 0 0 0 0 0 0 0 0 0 ...
## $ dash_present : num 0 1 0 1 0 0 0 0 0 1 ...
## $ multiple_spaces : num 0 0 0 0 0 0 0 1 0 0 ...
## $ parens_present : num 0 0 1 0 0 0 1 0 0 0 ...
## $ numbers_present : num 0 0 0 0 0 0 0 0 0 0 ...
## $ req_missing_or_short : int 0 0 0 0 0 1 0 0 0 0 ...
## $ has_heavy_engineering_terms : int 0 0 0 0 0 0 0 0 0 0 ...
## $ has_certification_terms : int 1 1 1 1 1 0 1 1 1 1 ...
## $ has_years_experience : int 0 0 0 1 0 0 0 0 0 0 ...
## $ has_degree_required : int 1 0 1 1 1 0 1 0 0 1 ...
## $ has_tool_software_terms : int 1 1 0 1 0 0 0 0 0 1 ...
## $ has_safety_regulation_terms : int 0 0 0 0 0 0 0 0 0 0 ...
## $ req_contains_heavy_lists : int 0 0 0 0 0 0 0 0 0 0 ...
## $ req_title_mismatch : int 0 0 0 0 0 0 0 0 0 0 ...
## $ has_urgent_language : int 0 0 0 0 0 0 0 0 0 1 ...
## $ has_no_experience_needed : int 0 0 0 0 0 0 0 0 0 0 ...
## $ has_salary_info : int 1 1 1 1 1 1 1 1 0 1 ...
## $ has_qualification_terms : int 0 0 0 0 0 1 0 0 0 0 ...
## $ has_benefits_stated : int 0 0 0 1 0 0 0 0 0 0 ...
## $ has_technical_terms : int 0 1 0 1 1 1 0 1 0 0 ...
## $ has_contact_number_or_whatsapp: int 0 0 0 0 0 0 0 0 0 0 ...
## $ has_company_language : int 1 1 1 1 1 1 1 1 0 1 ...
## $ has_commission_only_language : int 0 0 0 0 0 0 0 0 0 0 ...
## $ has_referral_bonus : num 0 0 0 0 0 0 0 0 0 0 ...
## $ has_signing_bonus : num 0 0 0 0 0 0 0 0 0 0 ...
## $ has_perks : num 0 0 0 1 0 0 0 0 0 0 ...
## $ has_relocation : num 0 0 0 0 0 0 0 0 0 0 ...
## $ salary_known : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 2 1 1 1 ...
## $ department_clean : Factor w/ 38 levels "account management",..: 22 27 24 34 24 24 27 24 24 24 ...
## $ industry_clean : Factor w/ 17 levels "Agriculture, Food & Natural Resources",..: 14 13 14 16 10 14 13 14 16 8 ...
## $ function_clean : Factor w/ 15 levels "Analytics & Business Development",..: 10 14 11 14 6 11 9 11 11 14 ...
## $ loc_country_new : Factor w/ 50 levels "AE","AT","AU",..: 49 35 49 49 49 49 11 49 49 49 ...
summary(job)
## title company_profile description requirements
## Min. : 3.00 Min. : 0.0 Min. : 3 Min. : 0.0
## 1st Qu.: 19.00 1st Qu.: 138.0 1st Qu.: 607 1st Qu.: 146.0
## Median : 25.00 Median : 570.0 Median : 1017 Median : 467.0
## Mean : 28.53 Mean : 620.9 Mean : 1218 Mean : 590.1
## 3rd Qu.: 35.00 3rd Qu.: 879.0 3rd Qu.: 1586 3rd Qu.: 820.0
## Max. :142.00 Max. :6178.0 Max. :14907 Max. :10864.0
##
## benefits telecommuting has_company_logo has_questions
## Min. : 0.0 Min. :0.0000 Min. :0.0000 Min. :0.0000
## 1st Qu.: 0.0 1st Qu.:0.0000 1st Qu.:1.0000 1st Qu.:0.0000
## Median : 45.0 Median :0.0000 Median :1.0000 Median :0.0000
## Mean : 208.9 Mean :0.0429 Mean :0.7953 Mean :0.4917
## 3rd Qu.: 294.0 3rd Qu.:0.0000 3rd Qu.:1.0000 3rd Qu.:1.0000
## Max. :4429.0 Max. :1.0000 Max. :1.0000 Max. :1.0000
##
## employment_type required_experience required_education
## : 3471 :7050 :8105
## Contract : 1524 Mid-Senior level:3809 Bachelor's Degree :5145
## Full-time:11620 Entry level :2697 High School or equivalent:2080
## Other : 227 Associate :2297 Unspecified :1397
## Part-time: 797 Not Applicable :1116 Master's Degree : 416
## Temporary: 241 Director : 389 Associate Degree : 274
## (Other) : 522 (Other) : 463
## fraudulent benefits_pipe benefits_hash benefits_bonus
## Min. :0.00000 Min. :0.000000 Min. :0.00000 Min. :0.00000
## 1st Qu.:0.00000 1st Qu.:0.000000 1st Qu.:0.00000 1st Qu.:0.00000
## Median :0.00000 Median :0.000000 Median :0.00000 Median :0.00000
## Mean :0.04843 Mean :0.001566 Mean :0.05543 Mean :0.07131
## 3rd Qu.:0.00000 3rd Qu.:0.000000 3rd Qu.:0.00000 3rd Qu.:0.00000
## Max. :1.00000 Max. :1.000000 Max. :1.00000 Max. :1.00000
##
## benefits_apply benefits_benefits slash_present backslash_present
## Min. :0.00000 Min. :0.0000 Min. :0.00000 Min. :0.0000000
## 1st Qu.:0.00000 1st Qu.:0.0000 1st Qu.:0.00000 1st Qu.:0.0000000
## Median :0.00000 Median :0.0000 Median :0.00000 Median :0.0000000
## Mean :0.04234 Mean :0.2012 Mean :0.09659 Mean :0.0001119
## 3rd Qu.:0.00000 3rd Qu.:0.0000 3rd Qu.:0.00000 3rd Qu.:0.0000000
## Max. :1.00000 Max. :1.0000 Max. :1.00000 Max. :1.0000000
##
## amp_present exclam_present dash_present multiple_spaces
## Min. :0.00000 Min. :0.00000 Min. :0.000 Min. :0.000000
## 1st Qu.:0.00000 1st Qu.:0.00000 1st Qu.:0.000 1st Qu.:0.000000
## Median :0.00000 Median :0.00000 Median :0.000 Median :0.000000
## Mean :0.03356 Mean :0.01102 Mean :0.169 Mean :0.009228
## 3rd Qu.:0.00000 3rd Qu.:0.00000 3rd Qu.:0.000 3rd Qu.:0.000000
## Max. :1.00000 Max. :1.00000 Max. :1.000 Max. :1.000000
##
## parens_present numbers_present req_missing_or_short
## Min. :0.00000 Min. :0.00000 Min. :0.0000
## 1st Qu.:0.00000 1st Qu.:0.00000 1st Qu.:0.0000
## Median :0.00000 Median :0.00000 Median :0.0000
## Mean :0.08853 Mean :0.04787 Mean :0.1763
## 3rd Qu.:0.00000 3rd Qu.:0.00000 3rd Qu.:0.0000
## Max. :1.00000 Max. :1.00000 Max. :1.0000
##
## has_heavy_engineering_terms has_certification_terms has_years_experience
## Min. :0.00000 Min. :0.0000 Min. :0.0000
## 1st Qu.:0.00000 1st Qu.:1.0000 1st Qu.:0.0000
## Median :0.00000 Median :1.0000 Median :0.0000
## Mean :0.06549 Mean :0.7724 Mean :0.3444
## 3rd Qu.:0.00000 3rd Qu.:1.0000 3rd Qu.:1.0000
## Max. :1.00000 Max. :1.0000 Max. :1.0000
##
## has_degree_required has_tool_software_terms has_safety_regulation_terms
## Min. :0.0000 Min. :0.0000 Min. :0.00000
## 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.00000
## Median :0.0000 Median :0.0000 Median :0.00000
## Mean :0.4238 Mean :0.3497 Mean :0.03216
## 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:0.00000
## Max. :1.0000 Max. :1.0000 Max. :1.00000
##
## req_contains_heavy_lists req_title_mismatch has_urgent_language
## Min. :0.00000 Min. :0.00000 Min. :0.00000
## 1st Qu.:0.00000 1st Qu.:0.00000 1st Qu.:0.00000
## Median :0.00000 Median :0.00000 Median :0.00000
## Mean :0.02136 Mean :0.06549 Mean :0.07959
## 3rd Qu.:0.00000 3rd Qu.:0.00000 3rd Qu.:0.00000
## Max. :1.00000 Max. :1.00000 Max. :1.00000
##
## has_no_experience_needed has_salary_info has_qualification_terms
## Min. :0.000000 Min. :0.0000 Min. :0.0000
## 1st Qu.:0.000000 1st Qu.:1.0000 1st Qu.:0.0000
## Median :0.000000 Median :1.0000 Median :0.0000
## Mean :0.007159 Mean :0.9705 Mean :0.1006
## 3rd Qu.:0.000000 3rd Qu.:1.0000 3rd Qu.:0.0000
## Max. :1.000000 Max. :1.0000 Max. :1.0000
##
## has_benefits_stated has_technical_terms has_contact_number_or_whatsapp
## Min. :0.00000 Min. :0.0000 Min. :0.00000
## 1st Qu.:0.00000 1st Qu.:0.0000 1st Qu.:0.00000
## Median :0.00000 Median :0.0000 Median :0.00000
## Mean :0.09077 Mean :0.2698 Mean :0.00179
## 3rd Qu.:0.00000 3rd Qu.:1.0000 3rd Qu.:0.00000
## Max. :1.00000 Max. :1.0000 Max. :1.00000
##
## has_company_language has_commission_only_language has_referral_bonus
## Min. :0.0000 Min. :0.00000 Min. :0.000000
## 1st Qu.:0.0000 1st Qu.:0.00000 1st Qu.:0.000000
## Median :1.0000 Median :0.00000 Median :0.000000
## Mean :0.6383 Mean :0.00453 Mean :0.006432
## 3rd Qu.:1.0000 3rd Qu.:0.00000 3rd Qu.:0.000000
## Max. :1.0000 Max. :1.00000 Max. :1.000000
##
## has_signing_bonus has_perks has_relocation salary_known
## Min. :0.000000 Min. :0.00000 Min. :0.000000 0:15012
## 1st Qu.:0.000000 1st Qu.:0.00000 1st Qu.:0.000000 1: 2868
## Median :0.000000 Median :0.00000 Median :0.000000
## Mean :0.003132 Mean :0.05872 Mean :0.009955
## 3rd Qu.:0.000000 3rd Qu.:0.00000 3rd Qu.:0.000000
## Max. :1.000000 Max. :1.00000 Max. :1.000000
##
## department_clean industry_clean
## NA :11553 NA :4903
## other : 2252 Technology & Software :4436
## sales : 594 Media, Entertainment & Creative :1486
## engineering : 512 Finance, Banking & Insurance :1189
## marketing : 443 Education & Training :1034
## tech_development: 377 Consumer Goods, Retail & Fashion: 889
## (Other) : 2149 (Other) :3943
## function_clean loc_country_new
## NA :6455 US :10656
## Sales & Customer Service & IT:4446 GB : 2384
## Engineering & Production :1835 GR : 940
## Management & Leadership :1061 CA : 457
## Marketing & Advertising : 996 DE : 383
## Arts : 604 NA : 346
## (Other) :2483 (Other): 2714
# Data for Decision Tree Model
str(job_dummy)
## 'data.frame': 17880 obs. of 187 variables:
## $ title : num 16 41 39 33 19 16 21 32 10 39 ...
## $ company_profile : num 885 1286 879 614 1628 ...
## $ description : num 905 2077 355 2600 1520 ...
## $ requirements : num 852 1433 1363 1429 757 ...
## $ benefits : num 0 1292 0 782 21 ...
## $ telecommuting : num 0 0 0 0 0 0 0 0 0 0 ...
## $ has_company_logo : num 1 1 1 1 1 0 1 1 1 1 ...
## $ has_questions : num 0 0 0 0 1 0 1 1 1 0 ...
## $ employment_type : num 0 0 1 0 0 1 0 1 0 0 ...
## $ employment_typeContract : num 0 0 0 0 0 0 0 0 0 0 ...
## $ employment_typeFull-time : num 0 1 0 1 1 0 1 0 1 0 ...
## $ employment_typeOther : num 1 0 0 0 0 0 0 0 0 0 ...
## $ employment_typePart-time : num 0 0 0 0 0 0 0 0 0 1 ...
## $ employment_typeTemporary : num 0 0 0 0 0 0 0 0 0 0 ...
## $ required_experienceAssociate : num 0 0 0 0 0 0 0 0 1 0 ...
## $ required_experienceDirector : num 0 0 0 0 0 0 0 0 0 0 ...
## $ required_experienceEntry level : num 0 0 0 0 0 0 0 0 0 1 ...
## $ required_experienceExecutive : num 0 0 0 0 0 0 0 0 0 0 ...
## $ required_experienceInternship : num 1 0 0 0 0 0 0 0 0 0 ...
## $ required_experienceMid-Senior level : num 0 0 0 1 1 0 1 0 0 0 ...
## $ required_experienceNot Applicable : num 0 1 0 0 0 0 0 0 0 0 ...
## $ required_educationAssociate Degree : num 0 0 0 0 0 0 0 0 0 0 ...
## $ required_educationBachelor's Degree : num 0 0 0 1 1 0 0 0 0 0 ...
## $ required_educationCertification : num 0 0 0 0 0 0 0 0 0 0 ...
## $ required_educationDoctorate : num 0 0 0 0 0 0 0 0 0 0 ...
## $ required_educationHigh School or equivalent : num 0 0 0 0 0 0 0 0 0 1 ...
## $ required_educationMaster's Degree : num 0 0 0 0 0 0 1 0 0 0 ...
## $ required_educationProfessional : num 0 0 0 0 0 0 0 0 0 0 ...
## $ required_educationSome College Coursework Completed : num 0 0 0 0 0 0 0 0 0 0 ...
## $ required_educationSome High School Coursework : num 0 0 0 0 0 0 0 0 0 0 ...
## $ required_educationUnspecified : num 0 0 0 0 0 0 0 0 0 0 ...
## $ required_educationVocational : num 0 0 0 0 0 0 0 0 0 0 ...
## $ required_educationVocational - Degree : num 0 0 0 0 0 0 0 0 0 0 ...
## $ required_educationVocational - HS Diploma : num 0 0 0 0 0 0 0 0 0 0 ...
## $ fraudulent : num 0 0 0 0 0 0 0 0 0 0 ...
## $ benefits_pipe : num 0 0 0 0 0 0 0 0 0 0 ...
## $ benefits_hash : num 0 0 0 0 0 0 0 0 0 0 ...
## $ benefits_bonus : num 0 0 0 0 0 0 0 0 0 0 ...
## $ benefits_apply : num 0 1 0 0 0 0 0 0 0 0 ...
## $ benefits_benefits : num 0 0 0 0 1 0 1 0 0 0 ...
## $ slash_present : num 0 0 0 0 0 0 1 0 0 0 ...
## $ backslash_present : num 0 0 0 0 0 0 0 0 0 0 ...
## $ amp_present : num 0 0 0 0 0 0 0 0 0 0 ...
## $ exclam_present : num 0 0 0 0 0 0 0 0 0 0 ...
## $ dash_present : num 0 1 0 1 0 0 0 0 0 1 ...
## $ multiple_spaces : num 0 0 0 0 0 0 0 1 0 0 ...
## $ parens_present : num 0 0 1 0 0 0 1 0 0 0 ...
## $ numbers_present : num 0 0 0 0 0 0 0 0 0 0 ...
## $ req_missing_or_short : num 0 0 0 0 0 1 0 0 0 0 ...
## $ has_heavy_engineering_terms : num 0 0 0 0 0 0 0 0 0 0 ...
## $ has_certification_terms : num 1 1 1 1 1 0 1 1 1 1 ...
## $ has_years_experience : num 0 0 0 1 0 0 0 0 0 0 ...
## $ has_degree_required : num 1 0 1 1 1 0 1 0 0 1 ...
## $ has_tool_software_terms : num 1 1 0 1 0 0 0 0 0 1 ...
## $ has_safety_regulation_terms : num 0 0 0 0 0 0 0 0 0 0 ...
## $ req_contains_heavy_lists : num 0 0 0 0 0 0 0 0 0 0 ...
## $ req_title_mismatch : num 0 0 0 0 0 0 0 0 0 0 ...
## $ has_urgent_language : num 0 0 0 0 0 0 0 0 0 1 ...
## $ has_no_experience_needed : num 0 0 0 0 0 0 0 0 0 0 ...
## $ has_salary_info : num 1 1 1 1 1 1 1 1 0 1 ...
## $ has_qualification_terms : num 0 0 0 0 0 1 0 0 0 0 ...
## $ has_benefits_stated : num 0 0 0 1 0 0 0 0 0 0 ...
## $ has_technical_terms : num 0 1 0 1 1 1 0 1 0 0 ...
## $ has_contact_number_or_whatsapp : num 0 0 0 0 0 0 0 0 0 0 ...
## $ has_company_language : num 1 1 1 1 1 1 1 1 0 1 ...
## $ has_commission_only_language : num 0 0 0 0 0 0 0 0 0 0 ...
## $ has_referral_bonus : num 0 0 0 0 0 0 0 0 0 0 ...
## $ has_signing_bonus : num 0 0 0 0 0 0 0 0 0 0 ...
## $ has_perks : num 0 0 0 1 0 0 0 0 0 0 ...
## $ has_relocation : num 0 0 0 0 0 0 0 0 0 0 ...
## $ salary_known1 : num 0 0 0 0 0 0 1 0 0 0 ...
## $ department_cleanaccounting : num 0 0 0 0 0 0 0 0 0 0 ...
## $ department_cleanadministration : num 0 0 0 0 0 0 0 0 0 0 ...
## $ department_cleanall : num 0 0 0 0 0 0 0 0 0 0 ...
## $ department_cleanart studio : num 0 0 0 0 0 0 0 0 0 0 ...
## $ department_cleanbusiness_management : num 0 0 0 0 0 0 0 0 0 0 ...
## $ department_cleanclerical : num 0 0 0 0 0 0 0 0 0 0 ...
## $ department_cleancommercial : num 0 0 0 0 0 0 0 0 0 0 ...
## $ department_cleancreative : num 0 0 0 0 0 0 0 0 0 0 ...
## $ department_cleancustomer service : num 0 0 0 0 0 0 0 0 0 0 ...
## $ department_cleancustomer_facing : num 0 0 0 0 0 0 0 0 0 0 ...
## $ department_cleandepartment : num 0 0 0 0 0 0 0 0 0 0 ...
## $ department_cleandigital : num 0 0 0 0 0 0 0 0 0 0 ...
## $ department_cleaneducation_training : num 0 0 0 0 0 0 0 0 0 0 ...
## $ department_cleanengagement : num 0 0 0 0 0 0 0 0 0 0 ...
## $ department_cleanengineering : num 0 0 0 0 0 0 0 0 0 0 ...
## $ department_cleanfinance : num 0 0 0 0 0 0 0 0 0 0 ...
## $ department_cleanhr : num 0 0 0 0 0 0 0 0 0 0 ...
## $ department_cleaninternational growth : num 0 0 0 0 0 0 0 0 0 0 ...
## $ department_cleanit : num 0 0 0 0 0 0 0 0 0 0 ...
## $ department_cleanlegal : num 0 0 0 0 0 0 0 0 0 0 ...
## $ department_cleanmarketing : num 1 0 0 0 0 0 0 0 0 0 ...
## $ department_cleanmerchandising : num 0 0 0 0 0 0 0 0 0 0 ...
## $ department_cleanNA : num 0 0 1 0 1 1 0 1 1 1 ...
## $ department_cleanoperations : num 0 0 0 0 0 0 0 0 0 0 ...
## $ department_cleanoperations_logistics : num 0 0 0 0 0 0 0 0 0 0 ...
## $ department_cleanother : num 0 1 0 0 0 0 1 0 0 0 ...
## $ department_cleanpermanent : num 0 0 0 0 0 0 0 0 0 0 ...
## $ department_cleanproduct : num 0 0 0 0 0 0 0 0 0 0 ...
## [list output truncated]
summary(job_dummy)
## title company_profile description requirements
## Min. : 3.00 Min. : 0.0 Min. : 3 Min. : 0.0
## 1st Qu.: 19.00 1st Qu.: 138.0 1st Qu.: 607 1st Qu.: 146.0
## Median : 25.00 Median : 570.0 Median : 1017 Median : 467.0
## Mean : 28.53 Mean : 620.9 Mean : 1218 Mean : 590.1
## 3rd Qu.: 35.00 3rd Qu.: 879.0 3rd Qu.: 1586 3rd Qu.: 820.0
## Max. :142.00 Max. :6178.0 Max. :14907 Max. :10864.0
## benefits telecommuting has_company_logo has_questions
## Min. : 0.0 Min. :0.0000 Min. :0.0000 Min. :0.0000
## 1st Qu.: 0.0 1st Qu.:0.0000 1st Qu.:1.0000 1st Qu.:0.0000
## Median : 45.0 Median :0.0000 Median :1.0000 Median :0.0000
## Mean : 208.9 Mean :0.0429 Mean :0.7953 Mean :0.4917
## 3rd Qu.: 294.0 3rd Qu.:0.0000 3rd Qu.:1.0000 3rd Qu.:1.0000
## Max. :4429.0 Max. :1.0000 Max. :1.0000 Max. :1.0000
## employment_type employment_typeContract employment_typeFull-time
## Min. :0.0000 Min. :0.00000 Min. :0.0000
## 1st Qu.:0.0000 1st Qu.:0.00000 1st Qu.:0.0000
## Median :0.0000 Median :0.00000 Median :1.0000
## Mean :0.1941 Mean :0.08523 Mean :0.6499
## 3rd Qu.:0.0000 3rd Qu.:0.00000 3rd Qu.:1.0000
## Max. :1.0000 Max. :1.00000 Max. :1.0000
## employment_typeOther employment_typePart-time employment_typeTemporary
## Min. :0.0000 Min. :0.00000 Min. :0.00000
## 1st Qu.:0.0000 1st Qu.:0.00000 1st Qu.:0.00000
## Median :0.0000 Median :0.00000 Median :0.00000
## Mean :0.0127 Mean :0.04457 Mean :0.01348
## 3rd Qu.:0.0000 3rd Qu.:0.00000 3rd Qu.:0.00000
## Max. :1.0000 Max. :1.00000 Max. :1.00000
## required_experienceAssociate required_experienceDirector
## Min. :0.0000 Min. :0.00000
## 1st Qu.:0.0000 1st Qu.:0.00000
## Median :0.0000 Median :0.00000
## Mean :0.1285 Mean :0.02176
## 3rd Qu.:0.0000 3rd Qu.:0.00000
## Max. :1.0000 Max. :1.00000
## required_experienceEntry level required_experienceExecutive
## Min. :0.0000 Min. :0.000000
## 1st Qu.:0.0000 1st Qu.:0.000000
## Median :0.0000 Median :0.000000
## Mean :0.1508 Mean :0.007886
## 3rd Qu.:0.0000 3rd Qu.:0.000000
## Max. :1.0000 Max. :1.000000
## required_experienceInternship required_experienceMid-Senior level
## Min. :0.00000 Min. :0.000
## 1st Qu.:0.00000 1st Qu.:0.000
## Median :0.00000 Median :0.000
## Mean :0.02131 Mean :0.213
## 3rd Qu.:0.00000 3rd Qu.:0.000
## Max. :1.00000 Max. :1.000
## required_experienceNot Applicable required_educationAssociate Degree
## Min. :0.00000 Min. :0.00000
## 1st Qu.:0.00000 1st Qu.:0.00000
## Median :0.00000 Median :0.00000
## Mean :0.06242 Mean :0.01532
## 3rd Qu.:0.00000 3rd Qu.:0.00000
## Max. :1.00000 Max. :1.00000
## required_educationBachelor's Degree required_educationCertification
## Min. :0.0000 Min. :0.000000
## 1st Qu.:0.0000 1st Qu.:0.000000
## Median :0.0000 Median :0.000000
## Mean :0.2878 Mean :0.009508
## 3rd Qu.:1.0000 3rd Qu.:0.000000
## Max. :1.0000 Max. :1.000000
## required_educationDoctorate required_educationHigh School or equivalent
## Min. :0.000000 Min. :0.0000
## 1st Qu.:0.000000 1st Qu.:0.0000
## Median :0.000000 Median :0.0000
## Mean :0.001454 Mean :0.1163
## 3rd Qu.:0.000000 3rd Qu.:0.0000
## Max. :1.000000 Max. :1.0000
## required_educationMaster's Degree required_educationProfessional
## Min. :0.00000 Min. :0.000000
## 1st Qu.:0.00000 1st Qu.:0.000000
## Median :0.00000 Median :0.000000
## Mean :0.02327 Mean :0.004139
## 3rd Qu.:0.00000 3rd Qu.:0.000000
## Max. :1.00000 Max. :1.000000
## required_educationSome College Coursework Completed
## Min. :0.000000
## 1st Qu.:0.000000
## Median :0.000000
## Mean :0.005705
## 3rd Qu.:0.000000
## Max. :1.000000
## required_educationSome High School Coursework required_educationUnspecified
## Min. :0.00000 Min. :0.00000
## 1st Qu.:0.00000 1st Qu.:0.00000
## Median :0.00000 Median :0.00000
## Mean :0.00151 Mean :0.07813
## 3rd Qu.:0.00000 3rd Qu.:0.00000
## Max. :1.00000 Max. :1.00000
## required_educationVocational required_educationVocational - Degree
## Min. :0.00000 Min. :0.0000000
## 1st Qu.:0.00000 1st Qu.:0.0000000
## Median :0.00000 Median :0.0000000
## Mean :0.00274 Mean :0.0003356
## 3rd Qu.:0.00000 3rd Qu.:0.0000000
## Max. :1.00000 Max. :1.0000000
## required_educationVocational - HS Diploma fraudulent benefits_pipe
## Min. :0.0000000 Min. :0.00000 Min. :0.000000
## 1st Qu.:0.0000000 1st Qu.:0.00000 1st Qu.:0.000000
## Median :0.0000000 Median :0.00000 Median :0.000000
## Mean :0.0005034 Mean :0.04843 Mean :0.001566
## 3rd Qu.:0.0000000 3rd Qu.:0.00000 3rd Qu.:0.000000
## Max. :1.0000000 Max. :1.00000 Max. :1.000000
## benefits_hash benefits_bonus benefits_apply benefits_benefits
## Min. :0.00000 Min. :0.00000 Min. :0.00000 Min. :0.0000
## 1st Qu.:0.00000 1st Qu.:0.00000 1st Qu.:0.00000 1st Qu.:0.0000
## Median :0.00000 Median :0.00000 Median :0.00000 Median :0.0000
## Mean :0.05543 Mean :0.07131 Mean :0.04234 Mean :0.2012
## 3rd Qu.:0.00000 3rd Qu.:0.00000 3rd Qu.:0.00000 3rd Qu.:0.0000
## Max. :1.00000 Max. :1.00000 Max. :1.00000 Max. :1.0000
## slash_present backslash_present amp_present exclam_present
## Min. :0.00000 Min. :0.0000000 Min. :0.00000 Min. :0.00000
## 1st Qu.:0.00000 1st Qu.:0.0000000 1st Qu.:0.00000 1st Qu.:0.00000
## Median :0.00000 Median :0.0000000 Median :0.00000 Median :0.00000
## Mean :0.09659 Mean :0.0001119 Mean :0.03356 Mean :0.01102
## 3rd Qu.:0.00000 3rd Qu.:0.0000000 3rd Qu.:0.00000 3rd Qu.:0.00000
## Max. :1.00000 Max. :1.0000000 Max. :1.00000 Max. :1.00000
## dash_present multiple_spaces parens_present numbers_present
## Min. :0.000 Min. :0.000000 Min. :0.00000 Min. :0.00000
## 1st Qu.:0.000 1st Qu.:0.000000 1st Qu.:0.00000 1st Qu.:0.00000
## Median :0.000 Median :0.000000 Median :0.00000 Median :0.00000
## Mean :0.169 Mean :0.009228 Mean :0.08853 Mean :0.04787
## 3rd Qu.:0.000 3rd Qu.:0.000000 3rd Qu.:0.00000 3rd Qu.:0.00000
## Max. :1.000 Max. :1.000000 Max. :1.00000 Max. :1.00000
## req_missing_or_short has_heavy_engineering_terms has_certification_terms
## Min. :0.0000 Min. :0.00000 Min. :0.0000
## 1st Qu.:0.0000 1st Qu.:0.00000 1st Qu.:1.0000
## Median :0.0000 Median :0.00000 Median :1.0000
## Mean :0.1763 Mean :0.06549 Mean :0.7724
## 3rd Qu.:0.0000 3rd Qu.:0.00000 3rd Qu.:1.0000
## Max. :1.0000 Max. :1.00000 Max. :1.0000
## has_years_experience has_degree_required has_tool_software_terms
## Min. :0.0000 Min. :0.0000 Min. :0.0000
## 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000
## Median :0.0000 Median :0.0000 Median :0.0000
## Mean :0.3444 Mean :0.4238 Mean :0.3497
## 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:1.0000
## Max. :1.0000 Max. :1.0000 Max. :1.0000
## has_safety_regulation_terms req_contains_heavy_lists req_title_mismatch
## Min. :0.00000 Min. :0.00000 Min. :0.00000
## 1st Qu.:0.00000 1st Qu.:0.00000 1st Qu.:0.00000
## Median :0.00000 Median :0.00000 Median :0.00000
## Mean :0.03216 Mean :0.02136 Mean :0.06549
## 3rd Qu.:0.00000 3rd Qu.:0.00000 3rd Qu.:0.00000
## Max. :1.00000 Max. :1.00000 Max. :1.00000
## has_urgent_language has_no_experience_needed has_salary_info
## Min. :0.00000 Min. :0.000000 Min. :0.0000
## 1st Qu.:0.00000 1st Qu.:0.000000 1st Qu.:1.0000
## Median :0.00000 Median :0.000000 Median :1.0000
## Mean :0.07959 Mean :0.007159 Mean :0.9705
## 3rd Qu.:0.00000 3rd Qu.:0.000000 3rd Qu.:1.0000
## Max. :1.00000 Max. :1.000000 Max. :1.0000
## has_qualification_terms has_benefits_stated has_technical_terms
## Min. :0.0000 Min. :0.00000 Min. :0.0000
## 1st Qu.:0.0000 1st Qu.:0.00000 1st Qu.:0.0000
## Median :0.0000 Median :0.00000 Median :0.0000
## Mean :0.1006 Mean :0.09077 Mean :0.2698
## 3rd Qu.:0.0000 3rd Qu.:0.00000 3rd Qu.:1.0000
## Max. :1.0000 Max. :1.00000 Max. :1.0000
## has_contact_number_or_whatsapp has_company_language
## Min. :0.00000 Min. :0.0000
## 1st Qu.:0.00000 1st Qu.:0.0000
## Median :0.00000 Median :1.0000
## Mean :0.00179 Mean :0.6383
## 3rd Qu.:0.00000 3rd Qu.:1.0000
## Max. :1.00000 Max. :1.0000
## has_commission_only_language has_referral_bonus has_signing_bonus
## Min. :0.00000 Min. :0.000000 Min. :0.000000
## 1st Qu.:0.00000 1st Qu.:0.000000 1st Qu.:0.000000
## Median :0.00000 Median :0.000000 Median :0.000000
## Mean :0.00453 Mean :0.006432 Mean :0.003132
## 3rd Qu.:0.00000 3rd Qu.:0.000000 3rd Qu.:0.000000
## Max. :1.00000 Max. :1.000000 Max. :1.000000
## has_perks has_relocation salary_known1
## Min. :0.00000 Min. :0.000000 Min. :0.0000
## 1st Qu.:0.00000 1st Qu.:0.000000 1st Qu.:0.0000
## Median :0.00000 Median :0.000000 Median :0.0000
## Mean :0.05872 Mean :0.009955 Mean :0.1604
## 3rd Qu.:0.00000 3rd Qu.:0.000000 3rd Qu.:0.0000
## Max. :1.00000 Max. :1.000000 Max. :1.0000
## department_cleanaccounting department_cleanadministration department_cleanall
## Min. :0.000000 Min. :0.000000 Min. :0.0000000
## 1st Qu.:0.000000 1st Qu.:0.000000 1st Qu.:0.0000000
## Median :0.000000 Median :0.000000 Median :0.0000000
## Mean :0.002685 Mean :0.005313 Mean :0.0008948
## 3rd Qu.:0.000000 3rd Qu.:0.000000 3rd Qu.:0.0000000
## Max. :1.000000 Max. :1.000000 Max. :1.0000000
## department_cleanart studio department_cleanbusiness_management
## Min. :0.0000000 Min. :0.000000
## 1st Qu.:0.0000000 1st Qu.:0.000000
## Median :0.0000000 Median :0.000000
## Mean :0.0006152 Mean :0.004586
## 3rd Qu.:0.0000000 3rd Qu.:0.000000
## Max. :1.0000000 Max. :1.000000
## department_cleanclerical department_cleancommercial department_cleancreative
## Min. :0.00000 Min. :0.000000 Min. :0.000000
## 1st Qu.:0.00000 1st Qu.:0.000000 1st Qu.:0.000000
## Median :0.00000 Median :0.000000 Median :0.000000
## Mean :0.00151 Mean :0.001007 Mean :0.002685
## 3rd Qu.:0.00000 3rd Qu.:0.000000 3rd Qu.:0.000000
## Max. :1.00000 Max. :1.000000 Max. :1.000000
## department_cleancustomer service department_cleancustomer_facing
## Min. :0.00000 Min. :0.000000
## 1st Qu.:0.00000 1st Qu.:0.000000
## Median :0.00000 Median :0.000000
## Mean :0.00755 Mean :0.006655
## 3rd Qu.:0.00000 3rd Qu.:0.000000
## Max. :1.00000 Max. :1.000000
## department_cleandepartment department_cleandigital
## Min. :0.000000 Min. :0.000000
## 1st Qu.:0.000000 1st Qu.:0.000000
## Median :0.000000 Median :0.000000
## Mean :0.001286 Mean :0.000783
## 3rd Qu.:0.000000 3rd Qu.:0.000000
## Max. :1.000000 Max. :1.000000
## department_cleaneducation_training department_cleanengagement
## Min. :0.000000 Min. :0.0000000
## 1st Qu.:0.000000 1st Qu.:0.0000000
## Median :0.000000 Median :0.0000000
## Mean :0.003244 Mean :0.0007271
## 3rd Qu.:0.000000 3rd Qu.:0.0000000
## Max. :1.000000 Max. :1.0000000
## department_cleanengineering department_cleanfinance department_cleanhr
## Min. :0.00000 Min. :0.000000 Min. :0.00000
## 1st Qu.:0.00000 1st Qu.:0.000000 1st Qu.:0.00000
## Median :0.00000 Median :0.000000 Median :0.00000
## Mean :0.02864 Mean :0.004139 Mean :0.00481
## 3rd Qu.:0.00000 3rd Qu.:0.000000 3rd Qu.:0.00000
## Max. :1.00000 Max. :1.000000 Max. :1.00000
## department_cleaninternational growth department_cleanit department_cleanlegal
## Min. :0.0000000 Min. :0.00000 Min. :0.000000
## 1st Qu.:0.0000000 1st Qu.:0.00000 1st Qu.:0.000000
## Median :0.0000000 Median :0.00000 Median :0.000000
## Mean :0.0009508 Mean :0.01985 Mean :0.001342
## 3rd Qu.:0.0000000 3rd Qu.:0.00000 3rd Qu.:0.000000
## Max. :1.0000000 Max. :1.00000 Max. :1.000000
## department_cleanmarketing department_cleanmerchandising department_cleanNA
## Min. :0.00000 Min. :0.0000000 Min. :0.0000
## 1st Qu.:0.00000 1st Qu.:0.0000000 1st Qu.:0.0000
## Median :0.00000 Median :0.0000000 Median :1.0000
## Mean :0.02478 Mean :0.0006152 Mean :0.6461
## 3rd Qu.:0.00000 3rd Qu.:0.0000000 3rd Qu.:1.0000
## Max. :1.00000 Max. :1.0000000 Max. :1.0000
## department_cleanoperations department_cleanoperations_logistics
## Min. :0.00000 Min. :0.000000
## 1st Qu.:0.00000 1st Qu.:0.000000
## Median :0.00000 Median :0.000000
## Mean :0.02047 Mean :0.001566
## 3rd Qu.:0.00000 3rd Qu.:0.000000
## Max. :1.00000 Max. :1.000000
## department_cleanother department_cleanpermanent department_cleanproduct
## Min. :0.000 Min. :0.0000000 Min. :0.00000
## 1st Qu.:0.000 1st Qu.:0.0000000 1st Qu.:0.00000
## Median :0.000 Median :0.0000000 Median :0.00000
## Mean :0.126 Mean :0.0007271 Mean :0.01035
## 3rd Qu.:0.000 3rd Qu.:0.0000000 3rd Qu.:0.00000
## Max. :1.000 Max. :1.0000000 Max. :1.00000
## department_cleanproduction department_cleanqa department_cleanr&d
## Min. :0.000000 Min. :0.000000 Min. :0.000000
## 1st Qu.:0.000000 1st Qu.:0.000000 1st Qu.:0.000000
## Median :0.000000 Median :0.000000 Median :0.000000
## Mean :0.001846 Mean :0.001007 Mean :0.003076
## 3rd Qu.:0.000000 3rd Qu.:0.000000 3rd Qu.:0.000000
## Max. :1.000000 Max. :1.000000 Max. :1.000000
## department_cleanretail department_cleansales department_cleansquiz
## Min. :0.000000 Min. :0.00000 Min. :0.000000
## 1st Qu.:0.000000 1st Qu.:0.00000 1st Qu.:0.000000
## Median :0.000000 Median :0.00000 Median :0.000000
## Mean :0.002573 Mean :0.03322 Mean :0.001119
## 3rd Qu.:0.000000 3rd Qu.:0.00000 3rd Qu.:0.000000
## Max. :1.000000 Max. :1.00000 Max. :1.000000
## department_cleansupport department_cleantech_development
## Min. :0.000000 Min. :0.00000
## 1st Qu.:0.000000 1st Qu.:0.00000
## Median :0.000000 Median :0.00000
## Mean :0.001063 Mean :0.02109
## 3rd Qu.:0.000000 3rd Qu.:0.00000
## Max. :1.000000 Max. :1.00000
## department_cleantechnology industry_cleanBusiness Administration
## Min. :0.000000 Min. :0.00000
## 1st Qu.:0.000000 1st Qu.:0.00000
## Median :0.000000 Median :0.00000
## Mean :0.004418 Mean :0.01359
## 3rd Qu.:0.000000 3rd Qu.:0.00000
## Max. :1.000000 Max. :1.00000
## industry_cleanConsulting, Professional Services & Legal
## Min. :0.00000
## 1st Qu.:0.00000
## Median :0.00000
## Mean :0.01493
## 3rd Qu.:0.00000
## Max. :1.00000
## industry_cleanConsumer Goods, Retail & Fashion
## Min. :0.00000
## 1st Qu.:0.00000
## Median :0.00000
## Mean :0.04972
## 3rd Qu.:0.00000
## Max. :1.00000
## industry_cleanDefense, Security & Aerospace industry_cleanEducation & Training
## Min. :0.000000 Min. :0.00000
## 1st Qu.:0.000000 1st Qu.:0.00000
## Median :0.000000 Median :0.00000
## Mean :0.003635 Mean :0.05783
## 3rd Qu.:0.000000 3rd Qu.:0.00000
## Max. :1.000000 Max. :1.00000
## industry_cleanEnergy, Utilities & Environment
## Min. :0.00000
## 1st Qu.:0.00000
## Median :0.00000
## Mean :0.04066
## 3rd Qu.:0.00000
## Max. :1.00000
## industry_cleanFinance, Banking & Insurance
## Min. :0.0000
## 1st Qu.:0.0000
## Median :0.0000
## Mean :0.0665
## 3rd Qu.:0.0000
## Max. :1.0000
## industry_cleanGovernment, Nonprofit & Public Sector
## Min. :0.00000
## 1st Qu.:0.00000
## Median :0.00000
## Mean :0.01113
## 3rd Qu.:0.00000
## Max. :1.00000
## industry_cleanHealthcare, Wellness & Life Sciences
## Min. :0.00000
## 1st Qu.:0.00000
## Median :0.00000
## Mean :0.04553
## 3rd Qu.:0.00000
## Max. :1.00000
## industry_cleanHospitality, Travel & Leisure
## Min. :0.00000
## 1st Qu.:0.00000
## Median :0.00000
## Mean :0.02601
## 3rd Qu.:0.00000
## Max. :1.00000
## industry_cleanManufacturing & Industrial
## Min. :0.00000
## 1st Qu.:0.00000
## Median :0.00000
## Mean :0.01879
## 3rd Qu.:0.00000
## Max. :1.00000
## industry_cleanMedia, Entertainment & Creative industry_cleanNA
## Min. :0.00000 Min. :0.0000
## 1st Qu.:0.00000 1st Qu.:0.0000
## Median :0.00000 Median :0.0000
## Mean :0.08311 Mean :0.2742
## 3rd Qu.:0.00000 3rd Qu.:1.0000
## Max. :1.00000 Max. :1.0000
## industry_cleanReal Estate & Construction industry_cleanTechnology & Software
## Min. :0.00000 Min. :0.0000
## 1st Qu.:0.00000 1st Qu.:0.0000
## Median :0.00000 Median :0.0000
## Mean :0.02377 Mean :0.2481
## 3rd Qu.:0.00000 3rd Qu.:0.0000
## Max. :1.00000 Max. :1.0000
## industry_cleanTransportation, Logistics & Supply Chain function_cleanArts
## Min. :0.00000 Min. :0.00000
## 1st Qu.:0.00000 1st Qu.:0.00000
## Median :0.00000 Median :0.00000
## Mean :0.01432 Mean :0.03378
## 3rd Qu.:0.00000 3rd Qu.:0.00000
## Max. :1.00000 Max. :1.00000
## function_cleanEducation function_cleanEngineering & Production
## Min. :0.00000 Min. :0.0000
## 1st Qu.:0.00000 1st Qu.:0.0000
## Median :0.00000 Median :0.0000
## Mean :0.01818 Mean :0.1026
## 3rd Qu.:0.00000 3rd Qu.:0.0000
## Max. :1.00000 Max. :1.0000
## function_cleanFinance & Accounting function_cleanHealthcare & Science
## Min. :0.00000 Min. :0.00000
## 1st Qu.:0.00000 1st Qu.:0.00000
## Median :0.00000 Median :0.00000
## Mean :0.02332 Mean :0.01969
## 3rd Qu.:0.00000 3rd Qu.:0.00000
## Max. :1.00000 Max. :1.00000
## function_cleanHuman Resources & Training function_cleanLegal & Compliance
## Min. :0.00000 Min. :0.000000
## 1st Qu.:0.00000 1st Qu.:0.000000
## Median :0.00000 Median :0.000000
## Mean :0.02164 Mean :0.008837
## 3rd Qu.:0.00000 3rd Qu.:0.000000
## Max. :1.00000 Max. :1.000000
## function_cleanManagement & Leadership function_cleanMarketing & Advertising
## Min. :0.00000 Min. :0.0000
## 1st Qu.:0.00000 1st Qu.:0.0000
## Median :0.00000 Median :0.0000
## Mean :0.05934 Mean :0.0557
## 3rd Qu.:0.00000 3rd Qu.:0.0000
## Max. :1.00000 Max. :1.0000
## function_cleanNA function_cleanOther function_cleanResearch
## Min. :0.000 Min. :0.00000 Min. :0.000000
## 1st Qu.:0.000 1st Qu.:0.00000 1st Qu.:0.000000
## Median :0.000 Median :0.00000 Median :0.000000
## Mean :0.361 Mean :0.01818 Mean :0.002796
## 3rd Qu.:1.000 3rd Qu.:0.00000 3rd Qu.:0.000000
## Max. :1.000 Max. :1.00000 Max. :1.000000
## function_cleanSales & Customer Service & IT
## Min. :0.0000
## 1st Qu.:0.0000
## Median :0.0000
## Mean :0.2487
## 3rd Qu.:0.0000
## Max. :1.0000
## function_cleanSupply Chain & Logistics loc_country_newAT loc_country_newAU
## Min. :0.000000 Min. :0.000000 Min. :0.00000
## 1st Qu.:0.000000 1st Qu.:0.000000 1st Qu.:0.00000
## Median :0.000000 Median :0.000000 Median :0.00000
## Mean :0.004195 Mean :0.000783 Mean :0.01197
## 3rd Qu.:0.000000 3rd Qu.:0.000000 3rd Qu.:0.00000
## Max. :1.000000 Max. :1.000000 Max. :1.00000
## loc_country_newBE loc_country_newBG loc_country_newBR loc_country_newCA
## Min. :0.000000 Min. :0.0000000 Min. :0.000000 Min. :0.00000
## 1st Qu.:0.000000 1st Qu.:0.0000000 1st Qu.:0.000000 1st Qu.:0.00000
## Median :0.000000 Median :0.0000000 Median :0.000000 Median :0.00000
## Mean :0.006544 Mean :0.0009508 Mean :0.002013 Mean :0.02556
## 3rd Qu.:0.000000 3rd Qu.:0.0000000 3rd Qu.:0.000000 3rd Qu.:0.00000
## Max. :1.000000 Max. :1.0000000 Max. :1.000000 Max. :1.00000
## loc_country_newCH loc_country_newCN loc_country_newCY loc_country_newDE
## Min. :0.0000000 Min. :0.0000000 Min. :0.0000000 Min. :0.00000
## 1st Qu.:0.0000000 1st Qu.:0.0000000 1st Qu.:0.0000000 1st Qu.:0.00000
## Median :0.0000000 Median :0.0000000 Median :0.0000000 Median :0.00000
## Mean :0.0008389 Mean :0.0008389 Mean :0.0006152 Mean :0.02142
## 3rd Qu.:0.0000000 3rd Qu.:0.0000000 3rd Qu.:0.0000000 3rd Qu.:0.00000
## Max. :1.0000000 Max. :1.0000000 Max. :1.0000000 Max. :1.00000
## loc_country_newDK loc_country_newEE loc_country_newEG loc_country_newES
## Min. :0.000000 Min. :0.000000 Min. :0.000000 Min. :0.000000
## 1st Qu.:0.000000 1st Qu.:0.000000 1st Qu.:0.000000 1st Qu.:0.000000
## Median :0.000000 Median :0.000000 Median :0.000000 Median :0.000000
## Mean :0.002349 Mean :0.004027 Mean :0.002908 Mean :0.003691
## 3rd Qu.:0.000000 3rd Qu.:0.000000 3rd Qu.:0.000000 3rd Qu.:0.000000
## Max. :1.000000 Max. :1.000000 Max. :1.000000 Max. :1.000000
## loc_country_newFI loc_country_newFR loc_country_newGB loc_country_newGR
## Min. :0.000000 Min. :0.000000 Min. :0.0000 Min. :0.00000
## 1st Qu.:0.000000 1st Qu.:0.000000 1st Qu.:0.0000 1st Qu.:0.00000
## Median :0.000000 Median :0.000000 Median :0.0000 Median :0.00000
## Mean :0.001622 Mean :0.003915 Mean :0.1333 Mean :0.05257
## 3rd Qu.:0.000000 3rd Qu.:0.000000 3rd Qu.:0.0000 3rd Qu.:0.00000
## Max. :1.000000 Max. :1.000000 Max. :1.0000 Max. :1.00000
## loc_country_newHK loc_country_newHU loc_country_newID loc_country_newIE
## Min. :0.000000 Min. :0.000000 Min. :0.0000000 Min. :0.000000
## 1st Qu.:0.000000 1st Qu.:0.000000 1st Qu.:0.0000000 1st Qu.:0.000000
## Median :0.000000 Median :0.000000 Median :0.0000000 Median :0.000000
## Mean :0.004306 Mean :0.000783 Mean :0.0007271 Mean :0.006376
## 3rd Qu.:0.000000 3rd Qu.:0.000000 3rd Qu.:0.0000000 3rd Qu.:0.000000
## Max. :1.000000 Max. :1.000000 Max. :1.0000000 Max. :1.000000
## loc_country_newIL loc_country_newIN loc_country_newIT loc_country_newJP
## Min. :0.000000 Min. :0.00000 Min. :0.000000 Min. :0.000000
## 1st Qu.:0.000000 1st Qu.:0.00000 1st Qu.:0.000000 1st Qu.:0.000000
## Median :0.000000 Median :0.00000 Median :0.000000 Median :0.000000
## Mean :0.004027 Mean :0.01544 Mean :0.001734 Mean :0.001119
## 3rd Qu.:0.000000 3rd Qu.:0.00000 3rd Qu.:0.000000 3rd Qu.:0.000000
## Max. :1.000000 Max. :1.00000 Max. :1.000000 Max. :1.000000
## loc_country_newLT loc_country_newMT loc_country_newMU loc_country_newMX
## Min. :0.000000 Min. :0.0000000 Min. :0.000000 Min. :0.000000
## 1st Qu.:0.000000 1st Qu.:0.0000000 1st Qu.:0.000000 1st Qu.:0.000000
## Median :0.000000 Median :0.0000000 Median :0.000000 Median :0.000000
## Mean :0.001286 Mean :0.0007271 Mean :0.000783 Mean :0.001007
## 3rd Qu.:0.000000 3rd Qu.:0.0000000 3rd Qu.:0.000000 3rd Qu.:0.000000
## Max. :1.000000 Max. :1.0000000 Max. :1.000000 Max. :1.000000
## loc_country_newMY loc_country_newNA loc_country_newNL loc_country_newNZ
## Min. :0.000000 Min. :0.00000 Min. :0.000000 Min. :0.00000
## 1st Qu.:0.000000 1st Qu.:0.00000 1st Qu.:0.000000 1st Qu.:0.00000
## Median :0.000000 Median :0.00000 Median :0.000000 Median :0.00000
## Mean :0.001174 Mean :0.01935 Mean :0.007103 Mean :0.01862
## 3rd Qu.:0.000000 3rd Qu.:0.00000 3rd Qu.:0.000000 3rd Qu.:0.00000
## Max. :1.000000 Max. :1.00000 Max. :1.000000 Max. :1.00000
## loc_country_newOther loc_country_newPH loc_country_newPK loc_country_newPL
## Min. :0.000000 Min. :0.000000 Min. :0.00000 Min. :0.000000
## 1st Qu.:0.000000 1st Qu.:0.000000 1st Qu.:0.00000 1st Qu.:0.000000
## Median :0.000000 Median :0.000000 Median :0.00000 Median :0.000000
## Mean :0.009508 Mean :0.007383 Mean :0.00151 Mean :0.004251
## 3rd Qu.:0.000000 3rd Qu.:0.000000 3rd Qu.:0.00000 3rd Qu.:0.000000
## Max. :1.000000 Max. :1.000000 Max. :1.00000 Max. :1.000000
## loc_country_newPT loc_country_newQA loc_country_newRO loc_country_newRU
## Min. :0.000000 Min. :0.000000 Min. :0.000000 Min. :0.000000
## 1st Qu.:0.000000 1st Qu.:0.000000 1st Qu.:0.000000 1st Qu.:0.000000
## Median :0.000000 Median :0.000000 Median :0.000000 Median :0.000000
## Mean :0.001007 Mean :0.001174 Mean :0.002573 Mean :0.001119
## 3rd Qu.:0.000000 3rd Qu.:0.000000 3rd Qu.:0.000000 3rd Qu.:0.000000
## Max. :1.000000 Max. :1.000000 Max. :1.000000 Max. :1.000000
## loc_country_newSA loc_country_newSE loc_country_newSG loc_country_newTR
## Min. :0.0000000 Min. :0.00000 Min. :0.000000 Min. :0.0000000
## 1st Qu.:0.0000000 1st Qu.:0.00000 1st Qu.:0.000000 1st Qu.:0.0000000
## Median :0.0000000 Median :0.00000 Median :0.000000 Median :0.0000000
## Mean :0.0008389 Mean :0.00274 Mean :0.004474 Mean :0.0009508
## 3rd Qu.:0.0000000 3rd Qu.:0.00000 3rd Qu.:0.000000 3rd Qu.:0.0000000
## Max. :1.0000000 Max. :1.00000 Max. :1.000000 Max. :1.0000000
## loc_country_newUA loc_country_newUS loc_country_newZA
## Min. :0.0000000 Min. :0.000 Min. :0.000000
## 1st Qu.:0.0000000 1st Qu.:0.000 1st Qu.:0.000000
## Median :0.0000000 Median :1.000 Median :0.000000
## Mean :0.0007271 Mean :0.596 Mean :0.002237
## 3rd Qu.:0.0000000 3rd Qu.:1.000 3rd Qu.:0.000000
## Max. :1.0000000 Max. :1.000 Max. :1.000000
# Data for KNN and ANN models
str(job_scaled)
## 'data.frame': 17880 obs. of 187 variables:
## $ title : num 0.0935 0.2734 0.259 0.2158 0.1151 ...
## $ company_profile : num 0.1433 0.2082 0.1423 0.0994 0.2635 ...
## $ description : num 0.0605 0.1392 0.0236 0.1742 0.1018 ...
## $ requirements : num 0.0784 0.1319 0.1255 0.1315 0.0697 ...
## $ benefits : num 0 0.29171 0 0.17656 0.00474 ...
## $ telecommuting : num 0 0 0 0 0 0 0 0 0 0 ...
## $ has_company_logo : num 1 1 1 1 1 0 1 1 1 1 ...
## $ has_questions : num 0 0 0 0 1 0 1 1 1 0 ...
## $ employment_type : num 0 0 1 0 0 1 0 1 0 0 ...
## $ employment_typeContract : num 0 0 0 0 0 0 0 0 0 0 ...
## $ employment_typeFull.time : num 0 1 0 1 1 0 1 0 1 0 ...
## $ employment_typeOther : num 1 0 0 0 0 0 0 0 0 0 ...
## $ employment_typePart.time : num 0 0 0 0 0 0 0 0 0 1 ...
## $ employment_typeTemporary : num 0 0 0 0 0 0 0 0 0 0 ...
## $ required_experienceAssociate : num 0 0 0 0 0 0 0 0 1 0 ...
## $ required_experienceDirector : num 0 0 0 0 0 0 0 0 0 0 ...
## $ required_experienceEntry.level : num 0 0 0 0 0 0 0 0 0 1 ...
## $ required_experienceExecutive : num 0 0 0 0 0 0 0 0 0 0 ...
## $ required_experienceInternship : num 1 0 0 0 0 0 0 0 0 0 ...
## $ required_experienceMid.Senior.level : num 0 0 0 1 1 0 1 0 0 0 ...
## $ required_experienceNot.Applicable : num 0 1 0 0 0 0 0 0 0 0 ...
## $ required_educationAssociate.Degree : num 0 0 0 0 0 0 0 0 0 0 ...
## $ required_educationBachelor.s.Degree : num 0 0 0 1 1 0 0 0 0 0 ...
## $ required_educationCertification : num 0 0 0 0 0 0 0 0 0 0 ...
## $ required_educationDoctorate : num 0 0 0 0 0 0 0 0 0 0 ...
## $ required_educationHigh.School.or.equivalent : num 0 0 0 0 0 0 0 0 0 1 ...
## $ required_educationMaster.s.Degree : num 0 0 0 0 0 0 1 0 0 0 ...
## $ required_educationProfessional : num 0 0 0 0 0 0 0 0 0 0 ...
## $ required_educationSome.College.Coursework.Completed : num 0 0 0 0 0 0 0 0 0 0 ...
## $ required_educationSome.High.School.Coursework : num 0 0 0 0 0 0 0 0 0 0 ...
## $ required_educationUnspecified : num 0 0 0 0 0 0 0 0 0 0 ...
## $ required_educationVocational : num 0 0 0 0 0 0 0 0 0 0 ...
## $ required_educationVocational...Degree : num 0 0 0 0 0 0 0 0 0 0 ...
## $ required_educationVocational...HS.Diploma : num 0 0 0 0 0 0 0 0 0 0 ...
## $ fraudulent : num 0 0 0 0 0 0 0 0 0 0 ...
## $ benefits_pipe : num 0 0 0 0 0 0 0 0 0 0 ...
## $ benefits_hash : num 0 0 0 0 0 0 0 0 0 0 ...
## $ benefits_bonus : num 0 0 0 0 0 0 0 0 0 0 ...
## $ benefits_apply : num 0 1 0 0 0 0 0 0 0 0 ...
## $ benefits_benefits : num 0 0 0 0 1 0 1 0 0 0 ...
## $ slash_present : num 0 0 0 0 0 0 1 0 0 0 ...
## $ backslash_present : num 0 0 0 0 0 0 0 0 0 0 ...
## $ amp_present : num 0 0 0 0 0 0 0 0 0 0 ...
## $ exclam_present : num 0 0 0 0 0 0 0 0 0 0 ...
## $ dash_present : num 0 1 0 1 0 0 0 0 0 1 ...
## $ multiple_spaces : num 0 0 0 0 0 0 0 1 0 0 ...
## $ parens_present : num 0 0 1 0 0 0 1 0 0 0 ...
## $ numbers_present : num 0 0 0 0 0 0 0 0 0 0 ...
## $ req_missing_or_short : num 0 0 0 0 0 1 0 0 0 0 ...
## $ has_heavy_engineering_terms : num 0 0 0 0 0 0 0 0 0 0 ...
## $ has_certification_terms : num 1 1 1 1 1 0 1 1 1 1 ...
## $ has_years_experience : num 0 0 0 1 0 0 0 0 0 0 ...
## $ has_degree_required : num 1 0 1 1 1 0 1 0 0 1 ...
## $ has_tool_software_terms : num 1 1 0 1 0 0 0 0 0 1 ...
## $ has_safety_regulation_terms : num 0 0 0 0 0 0 0 0 0 0 ...
## $ req_contains_heavy_lists : num 0 0 0 0 0 0 0 0 0 0 ...
## $ req_title_mismatch : num 0 0 0 0 0 0 0 0 0 0 ...
## $ has_urgent_language : num 0 0 0 0 0 0 0 0 0 1 ...
## $ has_no_experience_needed : num 0 0 0 0 0 0 0 0 0 0 ...
## $ has_salary_info : num 1 1 1 1 1 1 1 1 0 1 ...
## $ has_qualification_terms : num 0 0 0 0 0 1 0 0 0 0 ...
## $ has_benefits_stated : num 0 0 0 1 0 0 0 0 0 0 ...
## $ has_technical_terms : num 0 1 0 1 1 1 0 1 0 0 ...
## $ has_contact_number_or_whatsapp : num 0 0 0 0 0 0 0 0 0 0 ...
## $ has_company_language : num 1 1 1 1 1 1 1 1 0 1 ...
## $ has_commission_only_language : num 0 0 0 0 0 0 0 0 0 0 ...
## $ has_referral_bonus : num 0 0 0 0 0 0 0 0 0 0 ...
## $ has_signing_bonus : num 0 0 0 0 0 0 0 0 0 0 ...
## $ has_perks : num 0 0 0 1 0 0 0 0 0 0 ...
## $ has_relocation : num 0 0 0 0 0 0 0 0 0 0 ...
## $ salary_known1 : num 0 0 0 0 0 0 1 0 0 0 ...
## $ department_cleanaccounting : num 0 0 0 0 0 0 0 0 0 0 ...
## $ department_cleanadministration : num 0 0 0 0 0 0 0 0 0 0 ...
## $ department_cleanall : num 0 0 0 0 0 0 0 0 0 0 ...
## $ department_cleanart.studio : num 0 0 0 0 0 0 0 0 0 0 ...
## $ department_cleanbusiness_management : num 0 0 0 0 0 0 0 0 0 0 ...
## $ department_cleanclerical : num 0 0 0 0 0 0 0 0 0 0 ...
## $ department_cleancommercial : num 0 0 0 0 0 0 0 0 0 0 ...
## $ department_cleancreative : num 0 0 0 0 0 0 0 0 0 0 ...
## $ department_cleancustomer.service : num 0 0 0 0 0 0 0 0 0 0 ...
## $ department_cleancustomer_facing : num 0 0 0 0 0 0 0 0 0 0 ...
## $ department_cleandepartment : num 0 0 0 0 0 0 0 0 0 0 ...
## $ department_cleandigital : num 0 0 0 0 0 0 0 0 0 0 ...
## $ department_cleaneducation_training : num 0 0 0 0 0 0 0 0 0 0 ...
## $ department_cleanengagement : num 0 0 0 0 0 0 0 0 0 0 ...
## $ department_cleanengineering : num 0 0 0 0 0 0 0 0 0 0 ...
## $ department_cleanfinance : num 0 0 0 0 0 0 0 0 0 0 ...
## $ department_cleanhr : num 0 0 0 0 0 0 0 0 0 0 ...
## $ department_cleaninternational.growth : num 0 0 0 0 0 0 0 0 0 0 ...
## $ department_cleanit : num 0 0 0 0 0 0 0 0 0 0 ...
## $ department_cleanlegal : num 0 0 0 0 0 0 0 0 0 0 ...
## $ department_cleanmarketing : num 1 0 0 0 0 0 0 0 0 0 ...
## $ department_cleanmerchandising : num 0 0 0 0 0 0 0 0 0 0 ...
## $ department_cleanNA : num 0 0 1 0 1 1 0 1 1 1 ...
## $ department_cleanoperations : num 0 0 0 0 0 0 0 0 0 0 ...
## $ department_cleanoperations_logistics : num 0 0 0 0 0 0 0 0 0 0 ...
## $ department_cleanother : num 0 1 0 0 0 0 1 0 0 0 ...
## $ department_cleanpermanent : num 0 0 0 0 0 0 0 0 0 0 ...
## $ department_cleanproduct : num 0 0 0 0 0 0 0 0 0 0 ...
## [list output truncated]
summary(job_scaled)
## title company_profile description requirements
## Min. :0.0000 Min. :0.00000 Min. :0.00000 Min. :0.00000
## 1st Qu.:0.1151 1st Qu.:0.02234 1st Qu.:0.04053 1st Qu.:0.01344
## Median :0.1583 Median :0.09226 Median :0.06804 Median :0.04299
## Mean :0.1837 Mean :0.10050 Mean :0.08152 Mean :0.05432
## 3rd Qu.:0.2302 3rd Qu.:0.14228 3rd Qu.:0.10621 3rd Qu.:0.07548
## Max. :1.0000 Max. :1.00000 Max. :1.00000 Max. :1.00000
## benefits telecommuting has_company_logo has_questions
## Min. :0.00000 Min. :0.0000 Min. :0.0000 Min. :0.0000
## 1st Qu.:0.00000 1st Qu.:0.0000 1st Qu.:1.0000 1st Qu.:0.0000
## Median :0.01016 Median :0.0000 Median :1.0000 Median :0.0000
## Mean :0.04717 Mean :0.0429 Mean :0.7953 Mean :0.4917
## 3rd Qu.:0.06638 3rd Qu.:0.0000 3rd Qu.:1.0000 3rd Qu.:1.0000
## Max. :1.00000 Max. :1.0000 Max. :1.0000 Max. :1.0000
## employment_type employment_typeContract employment_typeFull.time
## Min. :0.0000 Min. :0.00000 Min. :0.0000
## 1st Qu.:0.0000 1st Qu.:0.00000 1st Qu.:0.0000
## Median :0.0000 Median :0.00000 Median :1.0000
## Mean :0.1941 Mean :0.08523 Mean :0.6499
## 3rd Qu.:0.0000 3rd Qu.:0.00000 3rd Qu.:1.0000
## Max. :1.0000 Max. :1.00000 Max. :1.0000
## employment_typeOther employment_typePart.time employment_typeTemporary
## Min. :0.0000 Min. :0.00000 Min. :0.00000
## 1st Qu.:0.0000 1st Qu.:0.00000 1st Qu.:0.00000
## Median :0.0000 Median :0.00000 Median :0.00000
## Mean :0.0127 Mean :0.04457 Mean :0.01348
## 3rd Qu.:0.0000 3rd Qu.:0.00000 3rd Qu.:0.00000
## Max. :1.0000 Max. :1.00000 Max. :1.00000
## required_experienceAssociate required_experienceDirector
## Min. :0.0000 Min. :0.00000
## 1st Qu.:0.0000 1st Qu.:0.00000
## Median :0.0000 Median :0.00000
## Mean :0.1285 Mean :0.02176
## 3rd Qu.:0.0000 3rd Qu.:0.00000
## Max. :1.0000 Max. :1.00000
## required_experienceEntry.level required_experienceExecutive
## Min. :0.0000 Min. :0.000000
## 1st Qu.:0.0000 1st Qu.:0.000000
## Median :0.0000 Median :0.000000
## Mean :0.1508 Mean :0.007886
## 3rd Qu.:0.0000 3rd Qu.:0.000000
## Max. :1.0000 Max. :1.000000
## required_experienceInternship required_experienceMid.Senior.level
## Min. :0.00000 Min. :0.000
## 1st Qu.:0.00000 1st Qu.:0.000
## Median :0.00000 Median :0.000
## Mean :0.02131 Mean :0.213
## 3rd Qu.:0.00000 3rd Qu.:0.000
## Max. :1.00000 Max. :1.000
## required_experienceNot.Applicable required_educationAssociate.Degree
## Min. :0.00000 Min. :0.00000
## 1st Qu.:0.00000 1st Qu.:0.00000
## Median :0.00000 Median :0.00000
## Mean :0.06242 Mean :0.01532
## 3rd Qu.:0.00000 3rd Qu.:0.00000
## Max. :1.00000 Max. :1.00000
## required_educationBachelor.s.Degree required_educationCertification
## Min. :0.0000 Min. :0.000000
## 1st Qu.:0.0000 1st Qu.:0.000000
## Median :0.0000 Median :0.000000
## Mean :0.2878 Mean :0.009508
## 3rd Qu.:1.0000 3rd Qu.:0.000000
## Max. :1.0000 Max. :1.000000
## required_educationDoctorate required_educationHigh.School.or.equivalent
## Min. :0.000000 Min. :0.0000
## 1st Qu.:0.000000 1st Qu.:0.0000
## Median :0.000000 Median :0.0000
## Mean :0.001454 Mean :0.1163
## 3rd Qu.:0.000000 3rd Qu.:0.0000
## Max. :1.000000 Max. :1.0000
## required_educationMaster.s.Degree required_educationProfessional
## Min. :0.00000 Min. :0.000000
## 1st Qu.:0.00000 1st Qu.:0.000000
## Median :0.00000 Median :0.000000
## Mean :0.02327 Mean :0.004139
## 3rd Qu.:0.00000 3rd Qu.:0.000000
## Max. :1.00000 Max. :1.000000
## required_educationSome.College.Coursework.Completed
## Min. :0.000000
## 1st Qu.:0.000000
## Median :0.000000
## Mean :0.005705
## 3rd Qu.:0.000000
## Max. :1.000000
## required_educationSome.High.School.Coursework required_educationUnspecified
## Min. :0.00000 Min. :0.00000
## 1st Qu.:0.00000 1st Qu.:0.00000
## Median :0.00000 Median :0.00000
## Mean :0.00151 Mean :0.07813
## 3rd Qu.:0.00000 3rd Qu.:0.00000
## Max. :1.00000 Max. :1.00000
## required_educationVocational required_educationVocational...Degree
## Min. :0.00000 Min. :0.0000000
## 1st Qu.:0.00000 1st Qu.:0.0000000
## Median :0.00000 Median :0.0000000
## Mean :0.00274 Mean :0.0003356
## 3rd Qu.:0.00000 3rd Qu.:0.0000000
## Max. :1.00000 Max. :1.0000000
## required_educationVocational...HS.Diploma fraudulent benefits_pipe
## Min. :0.0000000 Min. :0.00000 Min. :0.000000
## 1st Qu.:0.0000000 1st Qu.:0.00000 1st Qu.:0.000000
## Median :0.0000000 Median :0.00000 Median :0.000000
## Mean :0.0005034 Mean :0.04843 Mean :0.001566
## 3rd Qu.:0.0000000 3rd Qu.:0.00000 3rd Qu.:0.000000
## Max. :1.0000000 Max. :1.00000 Max. :1.000000
## benefits_hash benefits_bonus benefits_apply benefits_benefits
## Min. :0.00000 Min. :0.00000 Min. :0.00000 Min. :0.0000
## 1st Qu.:0.00000 1st Qu.:0.00000 1st Qu.:0.00000 1st Qu.:0.0000
## Median :0.00000 Median :0.00000 Median :0.00000 Median :0.0000
## Mean :0.05543 Mean :0.07131 Mean :0.04234 Mean :0.2012
## 3rd Qu.:0.00000 3rd Qu.:0.00000 3rd Qu.:0.00000 3rd Qu.:0.0000
## Max. :1.00000 Max. :1.00000 Max. :1.00000 Max. :1.0000
## slash_present backslash_present amp_present exclam_present
## Min. :0.00000 Min. :0.0000000 Min. :0.00000 Min. :0.00000
## 1st Qu.:0.00000 1st Qu.:0.0000000 1st Qu.:0.00000 1st Qu.:0.00000
## Median :0.00000 Median :0.0000000 Median :0.00000 Median :0.00000
## Mean :0.09659 Mean :0.0001119 Mean :0.03356 Mean :0.01102
## 3rd Qu.:0.00000 3rd Qu.:0.0000000 3rd Qu.:0.00000 3rd Qu.:0.00000
## Max. :1.00000 Max. :1.0000000 Max. :1.00000 Max. :1.00000
## dash_present multiple_spaces parens_present numbers_present
## Min. :0.000 Min. :0.000000 Min. :0.00000 Min. :0.00000
## 1st Qu.:0.000 1st Qu.:0.000000 1st Qu.:0.00000 1st Qu.:0.00000
## Median :0.000 Median :0.000000 Median :0.00000 Median :0.00000
## Mean :0.169 Mean :0.009228 Mean :0.08853 Mean :0.04787
## 3rd Qu.:0.000 3rd Qu.:0.000000 3rd Qu.:0.00000 3rd Qu.:0.00000
## Max. :1.000 Max. :1.000000 Max. :1.00000 Max. :1.00000
## req_missing_or_short has_heavy_engineering_terms has_certification_terms
## Min. :0.0000 Min. :0.00000 Min. :0.0000
## 1st Qu.:0.0000 1st Qu.:0.00000 1st Qu.:1.0000
## Median :0.0000 Median :0.00000 Median :1.0000
## Mean :0.1763 Mean :0.06549 Mean :0.7724
## 3rd Qu.:0.0000 3rd Qu.:0.00000 3rd Qu.:1.0000
## Max. :1.0000 Max. :1.00000 Max. :1.0000
## has_years_experience has_degree_required has_tool_software_terms
## Min. :0.0000 Min. :0.0000 Min. :0.0000
## 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000
## Median :0.0000 Median :0.0000 Median :0.0000
## Mean :0.3444 Mean :0.4238 Mean :0.3497
## 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:1.0000
## Max. :1.0000 Max. :1.0000 Max. :1.0000
## has_safety_regulation_terms req_contains_heavy_lists req_title_mismatch
## Min. :0.00000 Min. :0.00000 Min. :0.00000
## 1st Qu.:0.00000 1st Qu.:0.00000 1st Qu.:0.00000
## Median :0.00000 Median :0.00000 Median :0.00000
## Mean :0.03216 Mean :0.02136 Mean :0.06549
## 3rd Qu.:0.00000 3rd Qu.:0.00000 3rd Qu.:0.00000
## Max. :1.00000 Max. :1.00000 Max. :1.00000
## has_urgent_language has_no_experience_needed has_salary_info
## Min. :0.00000 Min. :0.000000 Min. :0.0000
## 1st Qu.:0.00000 1st Qu.:0.000000 1st Qu.:1.0000
## Median :0.00000 Median :0.000000 Median :1.0000
## Mean :0.07959 Mean :0.007159 Mean :0.9705
## 3rd Qu.:0.00000 3rd Qu.:0.000000 3rd Qu.:1.0000
## Max. :1.00000 Max. :1.000000 Max. :1.0000
## has_qualification_terms has_benefits_stated has_technical_terms
## Min. :0.0000 Min. :0.00000 Min. :0.0000
## 1st Qu.:0.0000 1st Qu.:0.00000 1st Qu.:0.0000
## Median :0.0000 Median :0.00000 Median :0.0000
## Mean :0.1006 Mean :0.09077 Mean :0.2698
## 3rd Qu.:0.0000 3rd Qu.:0.00000 3rd Qu.:1.0000
## Max. :1.0000 Max. :1.00000 Max. :1.0000
## has_contact_number_or_whatsapp has_company_language
## Min. :0.00000 Min. :0.0000
## 1st Qu.:0.00000 1st Qu.:0.0000
## Median :0.00000 Median :1.0000
## Mean :0.00179 Mean :0.6383
## 3rd Qu.:0.00000 3rd Qu.:1.0000
## Max. :1.00000 Max. :1.0000
## has_commission_only_language has_referral_bonus has_signing_bonus
## Min. :0.00000 Min. :0.000000 Min. :0.000000
## 1st Qu.:0.00000 1st Qu.:0.000000 1st Qu.:0.000000
## Median :0.00000 Median :0.000000 Median :0.000000
## Mean :0.00453 Mean :0.006432 Mean :0.003132
## 3rd Qu.:0.00000 3rd Qu.:0.000000 3rd Qu.:0.000000
## Max. :1.00000 Max. :1.000000 Max. :1.000000
## has_perks has_relocation salary_known1
## Min. :0.00000 Min. :0.000000 Min. :0.0000
## 1st Qu.:0.00000 1st Qu.:0.000000 1st Qu.:0.0000
## Median :0.00000 Median :0.000000 Median :0.0000
## Mean :0.05872 Mean :0.009955 Mean :0.1604
## 3rd Qu.:0.00000 3rd Qu.:0.000000 3rd Qu.:0.0000
## Max. :1.00000 Max. :1.000000 Max. :1.0000
## department_cleanaccounting department_cleanadministration department_cleanall
## Min. :0.000000 Min. :0.000000 Min. :0.0000000
## 1st Qu.:0.000000 1st Qu.:0.000000 1st Qu.:0.0000000
## Median :0.000000 Median :0.000000 Median :0.0000000
## Mean :0.002685 Mean :0.005313 Mean :0.0008948
## 3rd Qu.:0.000000 3rd Qu.:0.000000 3rd Qu.:0.0000000
## Max. :1.000000 Max. :1.000000 Max. :1.0000000
## department_cleanart.studio department_cleanbusiness_management
## Min. :0.0000000 Min. :0.000000
## 1st Qu.:0.0000000 1st Qu.:0.000000
## Median :0.0000000 Median :0.000000
## Mean :0.0006152 Mean :0.004586
## 3rd Qu.:0.0000000 3rd Qu.:0.000000
## Max. :1.0000000 Max. :1.000000
## department_cleanclerical department_cleancommercial department_cleancreative
## Min. :0.00000 Min. :0.000000 Min. :0.000000
## 1st Qu.:0.00000 1st Qu.:0.000000 1st Qu.:0.000000
## Median :0.00000 Median :0.000000 Median :0.000000
## Mean :0.00151 Mean :0.001007 Mean :0.002685
## 3rd Qu.:0.00000 3rd Qu.:0.000000 3rd Qu.:0.000000
## Max. :1.00000 Max. :1.000000 Max. :1.000000
## department_cleancustomer.service department_cleancustomer_facing
## Min. :0.00000 Min. :0.000000
## 1st Qu.:0.00000 1st Qu.:0.000000
## Median :0.00000 Median :0.000000
## Mean :0.00755 Mean :0.006655
## 3rd Qu.:0.00000 3rd Qu.:0.000000
## Max. :1.00000 Max. :1.000000
## department_cleandepartment department_cleandigital
## Min. :0.000000 Min. :0.000000
## 1st Qu.:0.000000 1st Qu.:0.000000
## Median :0.000000 Median :0.000000
## Mean :0.001286 Mean :0.000783
## 3rd Qu.:0.000000 3rd Qu.:0.000000
## Max. :1.000000 Max. :1.000000
## department_cleaneducation_training department_cleanengagement
## Min. :0.000000 Min. :0.0000000
## 1st Qu.:0.000000 1st Qu.:0.0000000
## Median :0.000000 Median :0.0000000
## Mean :0.003244 Mean :0.0007271
## 3rd Qu.:0.000000 3rd Qu.:0.0000000
## Max. :1.000000 Max. :1.0000000
## department_cleanengineering department_cleanfinance department_cleanhr
## Min. :0.00000 Min. :0.000000 Min. :0.00000
## 1st Qu.:0.00000 1st Qu.:0.000000 1st Qu.:0.00000
## Median :0.00000 Median :0.000000 Median :0.00000
## Mean :0.02864 Mean :0.004139 Mean :0.00481
## 3rd Qu.:0.00000 3rd Qu.:0.000000 3rd Qu.:0.00000
## Max. :1.00000 Max. :1.000000 Max. :1.00000
## department_cleaninternational.growth department_cleanit department_cleanlegal
## Min. :0.0000000 Min. :0.00000 Min. :0.000000
## 1st Qu.:0.0000000 1st Qu.:0.00000 1st Qu.:0.000000
## Median :0.0000000 Median :0.00000 Median :0.000000
## Mean :0.0009508 Mean :0.01985 Mean :0.001342
## 3rd Qu.:0.0000000 3rd Qu.:0.00000 3rd Qu.:0.000000
## Max. :1.0000000 Max. :1.00000 Max. :1.000000
## department_cleanmarketing department_cleanmerchandising department_cleanNA
## Min. :0.00000 Min. :0.0000000 Min. :0.0000
## 1st Qu.:0.00000 1st Qu.:0.0000000 1st Qu.:0.0000
## Median :0.00000 Median :0.0000000 Median :1.0000
## Mean :0.02478 Mean :0.0006152 Mean :0.6461
## 3rd Qu.:0.00000 3rd Qu.:0.0000000 3rd Qu.:1.0000
## Max. :1.00000 Max. :1.0000000 Max. :1.0000
## department_cleanoperations department_cleanoperations_logistics
## Min. :0.00000 Min. :0.000000
## 1st Qu.:0.00000 1st Qu.:0.000000
## Median :0.00000 Median :0.000000
## Mean :0.02047 Mean :0.001566
## 3rd Qu.:0.00000 3rd Qu.:0.000000
## Max. :1.00000 Max. :1.000000
## department_cleanother department_cleanpermanent department_cleanproduct
## Min. :0.000 Min. :0.0000000 Min. :0.00000
## 1st Qu.:0.000 1st Qu.:0.0000000 1st Qu.:0.00000
## Median :0.000 Median :0.0000000 Median :0.00000
## Mean :0.126 Mean :0.0007271 Mean :0.01035
## 3rd Qu.:0.000 3rd Qu.:0.0000000 3rd Qu.:0.00000
## Max. :1.000 Max. :1.0000000 Max. :1.00000
## department_cleanproduction department_cleanqa department_cleanr.d
## Min. :0.000000 Min. :0.000000 Min. :0.000000
## 1st Qu.:0.000000 1st Qu.:0.000000 1st Qu.:0.000000
## Median :0.000000 Median :0.000000 Median :0.000000
## Mean :0.001846 Mean :0.001007 Mean :0.003076
## 3rd Qu.:0.000000 3rd Qu.:0.000000 3rd Qu.:0.000000
## Max. :1.000000 Max. :1.000000 Max. :1.000000
## department_cleanretail department_cleansales department_cleansquiz
## Min. :0.000000 Min. :0.00000 Min. :0.000000
## 1st Qu.:0.000000 1st Qu.:0.00000 1st Qu.:0.000000
## Median :0.000000 Median :0.00000 Median :0.000000
## Mean :0.002573 Mean :0.03322 Mean :0.001119
## 3rd Qu.:0.000000 3rd Qu.:0.00000 3rd Qu.:0.000000
## Max. :1.000000 Max. :1.00000 Max. :1.000000
## department_cleansupport department_cleantech_development
## Min. :0.000000 Min. :0.00000
## 1st Qu.:0.000000 1st Qu.:0.00000
## Median :0.000000 Median :0.00000
## Mean :0.001063 Mean :0.02109
## 3rd Qu.:0.000000 3rd Qu.:0.00000
## Max. :1.000000 Max. :1.00000
## department_cleantechnology industry_cleanBusiness.Administration
## Min. :0.000000 Min. :0.00000
## 1st Qu.:0.000000 1st Qu.:0.00000
## Median :0.000000 Median :0.00000
## Mean :0.004418 Mean :0.01359
## 3rd Qu.:0.000000 3rd Qu.:0.00000
## Max. :1.000000 Max. :1.00000
## industry_cleanConsulting..Professional.Services...Legal
## Min. :0.00000
## 1st Qu.:0.00000
## Median :0.00000
## Mean :0.01493
## 3rd Qu.:0.00000
## Max. :1.00000
## industry_cleanConsumer.Goods..Retail...Fashion
## Min. :0.00000
## 1st Qu.:0.00000
## Median :0.00000
## Mean :0.04972
## 3rd Qu.:0.00000
## Max. :1.00000
## industry_cleanDefense..Security...Aerospace industry_cleanEducation...Training
## Min. :0.000000 Min. :0.00000
## 1st Qu.:0.000000 1st Qu.:0.00000
## Median :0.000000 Median :0.00000
## Mean :0.003635 Mean :0.05783
## 3rd Qu.:0.000000 3rd Qu.:0.00000
## Max. :1.000000 Max. :1.00000
## industry_cleanEnergy..Utilities...Environment
## Min. :0.00000
## 1st Qu.:0.00000
## Median :0.00000
## Mean :0.04066
## 3rd Qu.:0.00000
## Max. :1.00000
## industry_cleanFinance..Banking...Insurance
## Min. :0.0000
## 1st Qu.:0.0000
## Median :0.0000
## Mean :0.0665
## 3rd Qu.:0.0000
## Max. :1.0000
## industry_cleanGovernment..Nonprofit...Public.Sector
## Min. :0.00000
## 1st Qu.:0.00000
## Median :0.00000
## Mean :0.01113
## 3rd Qu.:0.00000
## Max. :1.00000
## industry_cleanHealthcare..Wellness...Life.Sciences
## Min. :0.00000
## 1st Qu.:0.00000
## Median :0.00000
## Mean :0.04553
## 3rd Qu.:0.00000
## Max. :1.00000
## industry_cleanHospitality..Travel...Leisure
## Min. :0.00000
## 1st Qu.:0.00000
## Median :0.00000
## Mean :0.02601
## 3rd Qu.:0.00000
## Max. :1.00000
## industry_cleanManufacturing...Industrial
## Min. :0.00000
## 1st Qu.:0.00000
## Median :0.00000
## Mean :0.01879
## 3rd Qu.:0.00000
## Max. :1.00000
## industry_cleanMedia..Entertainment...Creative industry_cleanNA
## Min. :0.00000 Min. :0.0000
## 1st Qu.:0.00000 1st Qu.:0.0000
## Median :0.00000 Median :0.0000
## Mean :0.08311 Mean :0.2742
## 3rd Qu.:0.00000 3rd Qu.:1.0000
## Max. :1.00000 Max. :1.0000
## industry_cleanReal.Estate...Construction industry_cleanTechnology...Software
## Min. :0.00000 Min. :0.0000
## 1st Qu.:0.00000 1st Qu.:0.0000
## Median :0.00000 Median :0.0000
## Mean :0.02377 Mean :0.2481
## 3rd Qu.:0.00000 3rd Qu.:0.0000
## Max. :1.00000 Max. :1.0000
## industry_cleanTransportation..Logistics...Supply.Chain function_cleanArts
## Min. :0.00000 Min. :0.00000
## 1st Qu.:0.00000 1st Qu.:0.00000
## Median :0.00000 Median :0.00000
## Mean :0.01432 Mean :0.03378
## 3rd Qu.:0.00000 3rd Qu.:0.00000
## Max. :1.00000 Max. :1.00000
## function_cleanEducation function_cleanEngineering...Production
## Min. :0.00000 Min. :0.0000
## 1st Qu.:0.00000 1st Qu.:0.0000
## Median :0.00000 Median :0.0000
## Mean :0.01818 Mean :0.1026
## 3rd Qu.:0.00000 3rd Qu.:0.0000
## Max. :1.00000 Max. :1.0000
## function_cleanFinance...Accounting function_cleanHealthcare...Science
## Min. :0.00000 Min. :0.00000
## 1st Qu.:0.00000 1st Qu.:0.00000
## Median :0.00000 Median :0.00000
## Mean :0.02332 Mean :0.01969
## 3rd Qu.:0.00000 3rd Qu.:0.00000
## Max. :1.00000 Max. :1.00000
## function_cleanHuman.Resources...Training function_cleanLegal...Compliance
## Min. :0.00000 Min. :0.000000
## 1st Qu.:0.00000 1st Qu.:0.000000
## Median :0.00000 Median :0.000000
## Mean :0.02164 Mean :0.008837
## 3rd Qu.:0.00000 3rd Qu.:0.000000
## Max. :1.00000 Max. :1.000000
## function_cleanManagement...Leadership function_cleanMarketing...Advertising
## Min. :0.00000 Min. :0.0000
## 1st Qu.:0.00000 1st Qu.:0.0000
## Median :0.00000 Median :0.0000
## Mean :0.05934 Mean :0.0557
## 3rd Qu.:0.00000 3rd Qu.:0.0000
## Max. :1.00000 Max. :1.0000
## function_cleanNA function_cleanOther function_cleanResearch
## Min. :0.000 Min. :0.00000 Min. :0.000000
## 1st Qu.:0.000 1st Qu.:0.00000 1st Qu.:0.000000
## Median :0.000 Median :0.00000 Median :0.000000
## Mean :0.361 Mean :0.01818 Mean :0.002796
## 3rd Qu.:1.000 3rd Qu.:0.00000 3rd Qu.:0.000000
## Max. :1.000 Max. :1.00000 Max. :1.000000
## function_cleanSales...Customer.Service...IT
## Min. :0.0000
## 1st Qu.:0.0000
## Median :0.0000
## Mean :0.2487
## 3rd Qu.:0.0000
## Max. :1.0000
## function_cleanSupply.Chain...Logistics loc_country_newAT loc_country_newAU
## Min. :0.000000 Min. :0.000000 Min. :0.00000
## 1st Qu.:0.000000 1st Qu.:0.000000 1st Qu.:0.00000
## Median :0.000000 Median :0.000000 Median :0.00000
## Mean :0.004195 Mean :0.000783 Mean :0.01197
## 3rd Qu.:0.000000 3rd Qu.:0.000000 3rd Qu.:0.00000
## Max. :1.000000 Max. :1.000000 Max. :1.00000
## loc_country_newBE loc_country_newBG loc_country_newBR loc_country_newCA
## Min. :0.000000 Min. :0.0000000 Min. :0.000000 Min. :0.00000
## 1st Qu.:0.000000 1st Qu.:0.0000000 1st Qu.:0.000000 1st Qu.:0.00000
## Median :0.000000 Median :0.0000000 Median :0.000000 Median :0.00000
## Mean :0.006544 Mean :0.0009508 Mean :0.002013 Mean :0.02556
## 3rd Qu.:0.000000 3rd Qu.:0.0000000 3rd Qu.:0.000000 3rd Qu.:0.00000
## Max. :1.000000 Max. :1.0000000 Max. :1.000000 Max. :1.00000
## loc_country_newCH loc_country_newCN loc_country_newCY loc_country_newDE
## Min. :0.0000000 Min. :0.0000000 Min. :0.0000000 Min. :0.00000
## 1st Qu.:0.0000000 1st Qu.:0.0000000 1st Qu.:0.0000000 1st Qu.:0.00000
## Median :0.0000000 Median :0.0000000 Median :0.0000000 Median :0.00000
## Mean :0.0008389 Mean :0.0008389 Mean :0.0006152 Mean :0.02142
## 3rd Qu.:0.0000000 3rd Qu.:0.0000000 3rd Qu.:0.0000000 3rd Qu.:0.00000
## Max. :1.0000000 Max. :1.0000000 Max. :1.0000000 Max. :1.00000
## loc_country_newDK loc_country_newEE loc_country_newEG loc_country_newES
## Min. :0.000000 Min. :0.000000 Min. :0.000000 Min. :0.000000
## 1st Qu.:0.000000 1st Qu.:0.000000 1st Qu.:0.000000 1st Qu.:0.000000
## Median :0.000000 Median :0.000000 Median :0.000000 Median :0.000000
## Mean :0.002349 Mean :0.004027 Mean :0.002908 Mean :0.003691
## 3rd Qu.:0.000000 3rd Qu.:0.000000 3rd Qu.:0.000000 3rd Qu.:0.000000
## Max. :1.000000 Max. :1.000000 Max. :1.000000 Max. :1.000000
## loc_country_newFI loc_country_newFR loc_country_newGB loc_country_newGR
## Min. :0.000000 Min. :0.000000 Min. :0.0000 Min. :0.00000
## 1st Qu.:0.000000 1st Qu.:0.000000 1st Qu.:0.0000 1st Qu.:0.00000
## Median :0.000000 Median :0.000000 Median :0.0000 Median :0.00000
## Mean :0.001622 Mean :0.003915 Mean :0.1333 Mean :0.05257
## 3rd Qu.:0.000000 3rd Qu.:0.000000 3rd Qu.:0.0000 3rd Qu.:0.00000
## Max. :1.000000 Max. :1.000000 Max. :1.0000 Max. :1.00000
## loc_country_newHK loc_country_newHU loc_country_newID loc_country_newIE
## Min. :0.000000 Min. :0.000000 Min. :0.0000000 Min. :0.000000
## 1st Qu.:0.000000 1st Qu.:0.000000 1st Qu.:0.0000000 1st Qu.:0.000000
## Median :0.000000 Median :0.000000 Median :0.0000000 Median :0.000000
## Mean :0.004306 Mean :0.000783 Mean :0.0007271 Mean :0.006376
## 3rd Qu.:0.000000 3rd Qu.:0.000000 3rd Qu.:0.0000000 3rd Qu.:0.000000
## Max. :1.000000 Max. :1.000000 Max. :1.0000000 Max. :1.000000
## loc_country_newIL loc_country_newIN loc_country_newIT loc_country_newJP
## Min. :0.000000 Min. :0.00000 Min. :0.000000 Min. :0.000000
## 1st Qu.:0.000000 1st Qu.:0.00000 1st Qu.:0.000000 1st Qu.:0.000000
## Median :0.000000 Median :0.00000 Median :0.000000 Median :0.000000
## Mean :0.004027 Mean :0.01544 Mean :0.001734 Mean :0.001119
## 3rd Qu.:0.000000 3rd Qu.:0.00000 3rd Qu.:0.000000 3rd Qu.:0.000000
## Max. :1.000000 Max. :1.00000 Max. :1.000000 Max. :1.000000
## loc_country_newLT loc_country_newMT loc_country_newMU loc_country_newMX
## Min. :0.000000 Min. :0.0000000 Min. :0.000000 Min. :0.000000
## 1st Qu.:0.000000 1st Qu.:0.0000000 1st Qu.:0.000000 1st Qu.:0.000000
## Median :0.000000 Median :0.0000000 Median :0.000000 Median :0.000000
## Mean :0.001286 Mean :0.0007271 Mean :0.000783 Mean :0.001007
## 3rd Qu.:0.000000 3rd Qu.:0.0000000 3rd Qu.:0.000000 3rd Qu.:0.000000
## Max. :1.000000 Max. :1.0000000 Max. :1.000000 Max. :1.000000
## loc_country_newMY loc_country_newNA loc_country_newNL loc_country_newNZ
## Min. :0.000000 Min. :0.00000 Min. :0.000000 Min. :0.00000
## 1st Qu.:0.000000 1st Qu.:0.00000 1st Qu.:0.000000 1st Qu.:0.00000
## Median :0.000000 Median :0.00000 Median :0.000000 Median :0.00000
## Mean :0.001174 Mean :0.01935 Mean :0.007103 Mean :0.01862
## 3rd Qu.:0.000000 3rd Qu.:0.00000 3rd Qu.:0.000000 3rd Qu.:0.00000
## Max. :1.000000 Max. :1.00000 Max. :1.000000 Max. :1.00000
## loc_country_newOther loc_country_newPH loc_country_newPK loc_country_newPL
## Min. :0.000000 Min. :0.000000 Min. :0.00000 Min. :0.000000
## 1st Qu.:0.000000 1st Qu.:0.000000 1st Qu.:0.00000 1st Qu.:0.000000
## Median :0.000000 Median :0.000000 Median :0.00000 Median :0.000000
## Mean :0.009508 Mean :0.007383 Mean :0.00151 Mean :0.004251
## 3rd Qu.:0.000000 3rd Qu.:0.000000 3rd Qu.:0.00000 3rd Qu.:0.000000
## Max. :1.000000 Max. :1.000000 Max. :1.00000 Max. :1.000000
## loc_country_newPT loc_country_newQA loc_country_newRO loc_country_newRU
## Min. :0.000000 Min. :0.000000 Min. :0.000000 Min. :0.000000
## 1st Qu.:0.000000 1st Qu.:0.000000 1st Qu.:0.000000 1st Qu.:0.000000
## Median :0.000000 Median :0.000000 Median :0.000000 Median :0.000000
## Mean :0.001007 Mean :0.001174 Mean :0.002573 Mean :0.001119
## 3rd Qu.:0.000000 3rd Qu.:0.000000 3rd Qu.:0.000000 3rd Qu.:0.000000
## Max. :1.000000 Max. :1.000000 Max. :1.000000 Max. :1.000000
## loc_country_newSA loc_country_newSE loc_country_newSG loc_country_newTR
## Min. :0.0000000 Min. :0.00000 Min. :0.000000 Min. :0.0000000
## 1st Qu.:0.0000000 1st Qu.:0.00000 1st Qu.:0.000000 1st Qu.:0.0000000
## Median :0.0000000 Median :0.00000 Median :0.000000 Median :0.0000000
## Mean :0.0008389 Mean :0.00274 Mean :0.004474 Mean :0.0009508
## 3rd Qu.:0.0000000 3rd Qu.:0.00000 3rd Qu.:0.000000 3rd Qu.:0.0000000
## Max. :1.0000000 Max. :1.00000 Max. :1.000000 Max. :1.0000000
## loc_country_newUA loc_country_newUS loc_country_newZA
## Min. :0.0000000 Min. :0.000 Min. :0.000000
## 1st Qu.:0.0000000 1st Qu.:0.000 1st Qu.:0.000000
## Median :0.0000000 Median :1.000 Median :0.000000
## Mean :0.0007271 Mean :0.596 Mean :0.002237
## 3rd Qu.:0.0000000 3rd Qu.:1.000 3rd Qu.:0.000000
## Max. :1.0000000 Max. :1.000 Max. :1.000000
After cleaning the data, we can split the data. We will do 70-30 split. It is important that we split the data so that we can use a portion of the data to train the models and the remaining portion of the data to then test of efficacy of the models created. While we typically do a 50-50 split for the first split when making stacked/two level models, we are doing a 70-30 split in this case because we have almost 18,000 rows of data. There will be enough data to effectively train and test the decision tree model that will combine the six models being created at this point, even if a 70-30 split is used.
# Let's do a 70-30 split.
trainprop <- 0.7 # This is the proportion of data we want in our training data set
set.seed(12345) # Let's make the randomization "not so random"
train_rows <- sample(1:nrow(job), trainprop*nrow(job)) # Get the rows for the training data. We can use train_rows for both job and job_scaled as both data sets have the same number of rows/observations.
# Train and test data for Logistic Regression, SVM, and Random Forest Models
job_train <- job[train_rows, ] # Store the training data
job_test <- job[-train_rows, ] # Store the testing data
# Train and test data for Decision Tree Model
job_dummy_train <- job_dummy[train_rows, ] # Store the training data
job_dummy_test <- job_dummy[-train_rows, ] # Store the testing data
# Train and test data for KNN and ANN models
job_scaled_train <- job_scaled[train_rows, ] # Store the training data
job_scaled_test <- job_scaled[-train_rows, ] # Store the testing data
# Let's do a quick check that random split worked (using dependent variable)
summary(job_train$fraudulent)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.00000 0.00000 0.04714 0.00000 1.00000
summary(job_test$fraudulent)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.00000 0.00000 0.05145 0.00000 1.00000
summary(job_dummy_train$fraudulent)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.00000 0.00000 0.04714 0.00000 1.00000
summary(job_dummy_test$fraudulent)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.00000 0.00000 0.05145 0.00000 1.00000
summary(job_scaled_train$fraudulent)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.00000 0.00000 0.04714 0.00000 1.00000
summary(job_scaled_test$fraudulent)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.00000 0.00000 0.05145 0.00000 1.00000
# The mean value is similar between the train and test data sets signifying the split was done successfully
Now that the data has been split, it is time to create the various
models. We will be loading in any necessary libraries, then creating the
models based on the training data (job_train and
job_scaled_train). Once the models are trained on the data
and then evaluated, we can then use them in the future to predict
whether job postings are fraudulent. We will also improve/optimize these
models, using different levers based on the model.
For the logistic regression model, we can improve the model by adding
combinations of predictors and changing the lr_pred_cutoff
value. We want to extract the log probabilities (lr_pred)
for the stacked model, so that we have the most “raw” version of the
predictions/results.
# Build Model
# Since we are trying to predict fraudulent, will have that be our response variable. Since we are using all other columns to predict fraudulent, those will be our predictor variables.
# Let's add some other combinations of predictors to increase the model's accuracy and sensitivity
# lr_model <- glm(fraudulent ~ . + industry_clean * function_clean
#+ description * benefits
#+ description * requirements
#+ required_experience * required_education
#+ employment_type * required_experience
#+ employment_type * required_education
#+ required_experience * has_years_experience
#, data = job_train, family = "binomial")
# saveRDS(lr_model, "lrJobModel.RDS")
# Since I have saved the model above, I can comment out the code above so that the model does not re-run each time I knit my file
lr_model <- readRDS("lrJobModel.RDS")
# Predictor - Accuracy | Sensitivity
# Base - 0.9627 | 0.42029
# industry_clean * function_clean - 0.9623 | 0.44928
# description * benefits - 0.9625 | 0.45290
# description * requirements - 0.962 | 0.45652
# required_experience * required_education - 0.9631 | 0.49275
# employment_type * required_experience - 0.9618 | 0.49638
# employment_type * required_education - 0.9532 | 0.52899
# required_experience * has_years_experience - 0.9629 | 0.52899
# The following combinations of predictors decreased the model's sensitivity and/or accuracy
#+ required_experience * has_no_experience_needed
#+ benefits * has_benefits_stated
#+ benefits * has_referral_bonus
#+ benefits * has_signing_bonus
# 2nd Logistic Regression Model - using step function to optimize through all combinations of predictors (.*.)
# 2nd model was attempted, however due to number of variables/columns, it took too long
# m1 <- glm(fraudulent ~ . + .*., data = job_train, family = "binomial")
# saveRDS(m1, "LRJobModel_m1.RDS")
# lr_model_2 <- step(m1, direction = "backward")
# saveRDS(lr_model_2, "LRJobModel_2.RDS")
# Since I have saved the model above, I can comment out the code above so that the model does not re-run each time I knit my file
# lr_model_2 <- readRDS("LRSporModel_2.RDS")
# Predict
# standard model
lr_pred <- predict(lr_model, job_test, type = "response")
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == :
## prediction from rank-deficient fit; attr(*, "non-estim") has doubtful cases
lr_pred_cutoff <- 0.5
lr_bin_pred <- ifelse(lr_pred >= lr_pred_cutoff, 1, 0)
For the decision tree model, we can improve the model by utilizing a
cost_matrix to change the weighting/ratio between false
positive and false negative that the model is trying to
balance/optimize. We will also create a decision tree model to use for
the stacked model, as we want the results (dt_pred) without
having placed our thumb on the scale.
library(C50) # We need this library to run a decision tree model
# Build Model (without weights)
# dt_model <- C5.0(as.factor(fraudulent) ~ ., data = job_dummy_train)
# saveRDS(dt_model, "dtJobModel.RDS")
# Since I have saved the model above, I can comment out the code above so that the model does not re-run each time I knit my file
dt_model <- readRDS("dtJobModel.RDS")
plot(dt_model)
# Predict (without weights)
dt_pred <- predict(dt_model, job_dummy_test)
# Build Model (with weights)
cost_matrix <- matrix(c(0, 1, 6, 0), nrow = 2)
cost_matrix # Check the matrix looks correct
## [,1] [,2]
## [1,] 0 6
## [2,] 1 0
# dt_cost_model <- C5.0(as.factor(fraudulent) ~ ., data = job_dummy_train, costs = cost_matrix)
# saveRDS(dt_cost_model, "dtJobModel_2.RDS")
# Since I have saved the model above, I can comment out the code above so that the model does not re-run each time I knit my file
dt_cost_model <- readRDS("dtJobModel_2.RDS")
plot(dt_cost_model)
# Predict (with weights)
dt_weights_pred <- predict(dt_cost_model, job_dummy_test)
For the SVM model, I can improve the model by changing the kernel. To
find the optimal kernel, it will involve a guess-and-check method. I can
also improve the model by changing the SVM_pred_cutoff
value. We want to extract the probabilities (SVM_pred) for
the stacked model, so that we have the most “raw” version of the
predictions/results.
library(kernlab) # We need this library to run a SVM model
# Build Model
# SVM_model <- ksvm(fraudulent ~ ., data = job_train, kernel = "rbfdot")
# SVM_model_2 <- ksvm(fraudulent ~ ., data = job_train, kernel = "tanhdot")
# saveRDS(SVM_model, "SVMJobModel.RDS")
# saveRDS(SVM_model_2, "SVMJobModel_2.RDS")
# Since I have saved the model above, I can comment out the code above so that the model does not re-run each time I knit my file
SVM_model <- readRDS("SVMJobModel.RDS")
SVM_model_2 <- readRDS("SVMJobModel_2.RDS")
# Kernel - Sensitivity | Accuracy
# rbfdot - 0.23551 | 0.9605
# polydot - 0.184783 | 0.9556
# vanilladot - 0.184783 | 0.9556
# tanhdot - 0.55072 | 0.5021
# laplacedot - 0.159420 | 0.9567
# besseldot - 0.29348 | 0.4746
# anovadot - 1.00000 | 0.0515
# splinedot - NA | NA <- no results after letting run for 30 minutes
# From testing all kernels, we can see that the rbfdot had the highest accuracy and 4th highest sensitivity. We can also see that tanhdot produces the highest sensitivity (however, significantly lower accuracy). As a result, we decided to run these two SVM models to see if higher accuracy or sensitivity will result in a better result.
# Predict
SVM_pred <- predict(SVM_model, job_test)
SVM_pred_2 <- predict(SVM_model_2, job_test)
SVM_pred_cutoff <- 0.1
# We reduced the cutoff value to 0.1 to reduce the number of false negatives and increase the accuracy and sensitivity of the model.
SVM_pred_2_cutoff <- 0.3
# We reduced the cutoff value to 0.3 to reduce the number of false negatives and increase the accuracy and sensitivity of the model.
SVM_bin_prob <- ifelse(SVM_pred >= SVM_pred_cutoff, 1, 0)
SVM_bin_prob_2 <- ifelse(SVM_pred_2 >= SVM_pred_2_cutoff, 1, 0)
For the random forest model, we can improve the model by modifying
the ntree and nodesize values.
library(randomForest) # We need this library to run a random forest model
## randomForest 4.7-1.2
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:dplyr':
##
## combine
# Build Model
# rf_model <- randomForest(as.factor(fraudulent) ~ ., data = job_train, ntree = 2000, nodesize = 5)
# saveRDS(rf_model, "rfJobModel.RDS")
# Since I have saved the model above, I can comment out the code above so that the model does not re-run each time I knit my file
rf_model <- readRDS("rfJobModel.RDS")
varImpPlot(rf_model) # From this plot, we can see that function, company_profile, and description are the biggest predictors of being fraudulent
# Predict
rf_pred <- predict(rf_model, job_test)
For the KNN model, we can improve the model by modifying the
k value and changing the KNN_pred_cutoff
value. We want to extract the probabilities (KNN_prob) for
the stacked model, so that we have the most “raw” version of the
predictions/results.
library(class)
# Identify predictor columns (all except the target)
predictor_cols <- colnames(job_scaled_train)[colnames(job_scaled_train) != "fraudulent"]
# Train and test predictor matrices
train_X <- job_scaled_train[, predictor_cols]
test_X <- job_scaled_test[, predictor_cols]
# Target vector
train_y <- job_scaled_train$fraudulent
# Run KNN
# KNN_pred <- knn(train = train_X, test = test_X, cl = train_y, k = 4, prob = TRUE) # We optimized k over the range [2, 100] for accuracy and sensitivity
# saveRDS(KNN_pred, "KNNJobModel.RDS")
# k = # | Accuracy | Sensitivity
# k = 100 | 0.9485 | 0.000000
# k = 75 | 0.9485 | 0.000000
# k = 50 | 0.9508 | 0.043478
# k = 40 | 0.9534 | 0.105072
# k = 30 | 0.9603 | 0.25725
# k = 20 | 0.9681 | 0.43478
# k = 10 | 0.9724 | 0.57971
# k = 5 | 0.9754 | 0.65580
# k = 4 | 0.9735 | 0.71377 # highest sensitivity
# k = 3 | 0.9767 | 0.70652 # highest accuracy
# k = 2 | 0.9674 | 0.81522
# We choose k = 4 since k = 4 and k = 3 have similar results. However, the increase in sensitivity from going from 3 to 4 is larger that the decrease in accuracy from 3 to 4.
# Since I have saved the model above, I can comment out the code above so that the model does not re-run each time I knit my file
KNN_pred <- readRDS("KNNJobModel.RDS")
# Convert KNN "prob" attribute to numeric probabilities
KNN_prob <- ifelse(KNN_pred == "1",
attr(KNN_pred, "prob"),
1 - attr(KNN_pred, "prob"))
# Apply cutoff
KNN_pred_cutoff <- 0.3
# We reduced the cutoff value to 0.3 to reduce the number of false negatives and increase the sensitivity of the model.
KNN_bin_prob <- ifelse(KNN_prob >= KNN_pred_cutoff, 1, 0)
#library(class) # We need this library to run a KNN model
# Build Model + Predict
#KNN_pred <- knn(train = job_scaled_train[, -100],
# test = job_scaled_test[, -100],
# cl = job_scaled_train[, -100],
# k = 15, prob = TRUE)
#KNN_prob <- ifelse(KNN_pred == "1", attr(KNN_pred, "prob"), 1 - attr(KNN_pred, "prob")) #to get most raw data
#KNN_pred_cutoff <- 0.5
#KNN_bin_prob <- ifelse(KNN_prob >= KNN_pred_cutoff, 1, 0)
For the ANN model, we can improve the model by changing the number of
nodes (e.g., hidden = c(5, 3, 2)) and changing the
ANN_pred_cutoff value. We want to extract the fractional
values (ANN_pred) for the stacked model, so that we have
the most “raw” version of the predictions/results.
library(neuralnet) # We need this library to run a ANN model
##
## Attaching package: 'neuralnet'
## The following object is masked from 'package:dplyr':
##
## compute
# Build Model
set.seed(12345) # Let's make the randomization "not so random"
#ANN_model_1 <- neuralnet(fraudulent ~ ., data = job_scaled_train, lifesign = "full", stepmax = 1e8)
#ANN_model_2 <- neuralnet(fraudulent ~ ., data = job_scaled_train, lifesign = "full", stepmax = 1e8, hidden = c(3, 2))
#ANN_model_3 <- neuralnet(fraudulent ~ ., data = job_scaled_train, lifesign = "full", stepmax = 1e8, hidden = c(5, 3, 2))
#ANN_model_4 <- neuralnet(fraudulent ~ ., data = job_scaled_train, lifesign = "full", stepmax = 1e8, hidden = c(5, 3, 3, 2))
#saveRDS(ANN_model_1, "ANNJobModel_1.RDS")
#saveRDS(ANN_model_2, "ANNJobModel_2.RDS")
#saveRDS(ANN_model_3, "ANNJobModel_3.RDS")
#saveRDS(ANN_model_4, "ANNJobModel_4.RDS")
# Since I have saved the model above, I can comment out the code above so that the model does not re-run each time I knit my file
ANN_model_1 <- readRDS("ANNJobModel_1.RDS")
ANN_model_2 <- readRDS("ANNJobModel_2.RDS")
ANN_model_3 <- readRDS("ANNJobModel_3.RDS")
ANN_model_4 <- readRDS("ANNJobModel_4.RDS")
plot(ANN_model_1, rep = "best")
plot(ANN_model_2, rep = "best")
plot(ANN_model_3, rep = "best")
plot(ANN_model_4, rep = "best")
# Predict
ANN_pred <- predict(ANN_model_1, job_scaled_test)
ANN_pred_cutoff <- 0.3 # We reduced the cutoff value to 0.3 to reduce the number of false negatives and increase the accuracy and sensitivity of the model.
ANN_bin_pred <- ifelse(ANN_pred >= ANN_pred_cutoff, 1, 0)
ANN_pred_2 <- predict(ANN_model_2, job_scaled_test)
ANN_pred_cutoff_2 <- 0.3 # We reduced the cutoff value to 0.3 to increase the sensitivity of the model.
ANN_bin_pred_2 <- ifelse(ANN_pred_2 >= ANN_pred_cutoff_2, 1, 0)
ANN_pred_3 <- predict(ANN_model_3, job_scaled_test)
ANN_pred_cutoff_3 <- 0.2 # We reduced the cutoff value to 0.2 to reduce the number of false negatives and increase the sensitivity of the model.
ANN_bin_pred_3 <- ifelse(ANN_pred_3 >= ANN_pred_cutoff_3, 1, 0)
ANN_pred_4 <- predict(ANN_model_4, job_scaled_test)
ANN_pred_cutoff_4 <- 0.2 # We reduced the cutoff value to 0.2 to reduce the number of false negatives and increase the sensitivity of the model.
ANN_bin_pred_4 <- ifelse(ANN_pred_4 >= ANN_pred_cutoff_4, 1, 0)
We will now take the prediction results of the various models and create a new data set to build the stacked model off of. Similar to the normal work flow, we will split the data, then use it to the build the 2nd level decision tree model. Finally, we will use the decision tree model produced to predict the test data set. We will also used a cost matrix to optimize the model.
# Combine the predictions of the 7 individual models into a new data frame
stacked_data <- data.frame(
lr_pred = c(lr_pred),
dt_pred = c(dt_pred),
SVM_pred = c(SVM_pred),
SVM_pred_2 = c(SVM_pred_2),
rf_pred = c(rf_pred),
KNN_pred = c(KNN_prob),
ANN_pred = c(ANN_pred_3),
actual = c(job_test$fraudulent)
)
# Split the data in to train and test data
# Let's do a 50-50 split, again (since there is such a small amount of data). We want there to be a somewhat decent amount of test data
trainprop <- 0.5 # This is the proportion of data we want in our training data set
set.seed(12345) # Let's make the randomization "not so random"
stacked_train_rows <- sample(1:nrow(stacked_data), trainprop*nrow(stacked_data)) # Get the rows for the training data. We can use train_rows for both churn_data and churn_scaled as both data sets have the same number of rows/observations.
# Train and test data for the stacked model
stacked_train <- stacked_data[stacked_train_rows, ] # Store the training data
stacked_test <- stacked_data[-stacked_train_rows, ] # Store the testing data
# Let's do a quick check that random split worked (using dependent variable)
summary(stacked_train$actual)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.00000 0.00000 0.05295 0.00000 1.00000
summary(stacked_test$actual)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.00000 0.00000 0.04996 0.00000 1.00000
# The mean value is similar between the train and test data sets signifying the split was done successfully
# Build and predict a decision tree model as a model stacked on top the other five models
# Build Model (without weights)
# stacked_unweighted_model <- C5.0(as.factor(actual) ~ ., data = stacked_train)
# saveRDS(stacked_unweighted_model, "stackedJobModel.RDS")
# Since I have saved the model above, I can comment out the code above so that the model does not re-run each time I knit my file
stacked_unweighted_model <- readRDS("stackedJobModel.RDS")
plot(stacked_unweighted_model)
# Build Model (with weights)
stacked_cost_matrix <- matrix(c(0, 1, 5, 0), nrow = 2)
stacked_cost_matrix # Check the matrix looks correct
## [,1] [,2]
## [1,] 0 5
## [2,] 1 0
# stacked_model <- C5.0(as.factor(actual) ~ ., data = stacked_train, costs = stacked_cost_matrix)
# saveRDS(stacked_model, "stackedJobModel_2.RDS")
# Since I have saved the model above, I can comment out the code above so that the model does not re-run each time I knit my file
stacked_model <- readRDS("stackedJobModel_2.RDS")
plot(stacked_model)
# Predict (without weights)
stacked_unweighted_pred <- predict(stacked_unweighted_model, stacked_test)
# Predict (with weights)
stacked_pred <- predict(stacked_model, stacked_test)
Now that the models are created, we can evaluate them by creating confusion matrices. It will be important to look at the accuracy and sensitivity of the model. We also want to make sure that we am minimizing false negatives, as those are much more costly that false positives. [explain what false negatives and false positives are and why false negatives are worse]
# Let's build some confusion matrices
library(caret) # We need this library to build a confusion matrix
library(knitr) # Load in library so that the table is formatted in an easy to read manner
cm_lr <- confusionMatrix(as.factor(lr_bin_pred), as.factor(job_test$fraudulent), positive = "1")
cm_lr
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 5019 130
## 1 69 146
##
## Accuracy : 0.9629
## 95% CI : (0.9575, 0.9678)
## No Information Rate : 0.9485
## P-Value [Acc > NIR] : 3.644e-07
##
## Kappa : 0.5756
##
## Mcnemar's Test P-Value : 2.107e-05
##
## Sensitivity : 0.52899
## Specificity : 0.98644
## Pos Pred Value : 0.67907
## Neg Pred Value : 0.97475
## Prevalence : 0.05145
## Detection Rate : 0.02722
## Detection Prevalence : 0.04008
## Balanced Accuracy : 0.75771
##
## 'Positive' Class : 1
##
# Decision Tree Model (without weights)
cm_unweighted_dt <- confusionMatrix(as.factor(dt_pred), as.factor(job_test$fraudulent), positive = "1")
# Looking at the confusion matrix, we need to apply a cost matrix. In this situation, the false negatives are extremely costly. As such, we want to apply a cost matrix that weights the false negatives appropriately. I will apply a cost matrix that costs false negatives at 6:1 ratio to false positives to reduce the number of false negatives. However, this will increase the number of false positives. However, we are not as concerned about this as it is less costly to deal with jobs posts that are falsely flagged than fraudulent posts that are missed.
cm_unweighted_dt
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 5038 96
## 1 50 180
##
## Accuracy : 0.9728
## 95% CI : (0.9681, 0.977)
## No Information Rate : 0.9485
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.6973
##
## Mcnemar's Test P-Value : 0.0001959
##
## Sensitivity : 0.65217
## Specificity : 0.99017
## Pos Pred Value : 0.78261
## Neg Pred Value : 0.98130
## Prevalence : 0.05145
## Detection Rate : 0.03356
## Detection Prevalence : 0.04288
## Balanced Accuracy : 0.82117
##
## 'Positive' Class : 1
##
# Decision Tree Model (with weights)
cm_dt <- confusionMatrix(as.factor(dt_weights_pred), as.factor(job_test$fraudulent), positive = "1")
cm_dt
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 4937 47
## 1 151 229
##
## Accuracy : 0.9631
## 95% CI : (0.9577, 0.968)
## No Information Rate : 0.9485
## P-Value [Acc > NIR] : 2.557e-07
##
## Kappa : 0.679
##
## Mcnemar's Test P-Value : 2.482e-13
##
## Sensitivity : 0.82971
## Specificity : 0.97032
## Pos Pred Value : 0.60263
## Neg Pred Value : 0.99057
## Prevalence : 0.05145
## Detection Rate : 0.04269
## Detection Prevalence : 0.07084
## Balanced Accuracy : 0.90002
##
## 'Positive' Class : 1
##
cm_SVM <- confusionMatrix(as.factor(SVM_bin_prob), as.factor(job_test$fraudulent), positive = "1")
cm_SVM
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 5052 126
## 1 36 150
##
## Accuracy : 0.9698
## 95% CI : (0.9649, 0.9742)
## No Information Rate : 0.9485
## P-Value [Acc > NIR] : 1.968e-14
##
## Kappa : 0.6342
##
## Mcnemar's Test P-Value : 2.700e-12
##
## Sensitivity : 0.54348
## Specificity : 0.99292
## Pos Pred Value : 0.80645
## Neg Pred Value : 0.97567
## Prevalence : 0.05145
## Detection Rate : 0.02796
## Detection Prevalence : 0.03468
## Balanced Accuracy : 0.76820
##
## 'Positive' Class : 1
##
cm_SVM_2 <- confusionMatrix(as.factor(SVM_bin_prob_2), as.factor(job_test$fraudulent), positive = "1")
cm_SVM_2
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 2537 124
## 1 2551 152
##
## Accuracy : 0.5013
## 95% CI : (0.4878, 0.5148)
## No Information Rate : 0.9485
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.0096
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.55072
## Specificity : 0.49862
## Pos Pred Value : 0.05623
## Neg Pred Value : 0.95340
## Prevalence : 0.05145
## Detection Rate : 0.02834
## Detection Prevalence : 0.50391
## Balanced Accuracy : 0.52467
##
## 'Positive' Class : 1
##
cm_rf <- confusionMatrix(as.factor(rf_pred), as.factor(job_test$fraudulent), positive = "1")
cm_rf
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 5085 93
## 1 3 183
##
## Accuracy : 0.9821
## 95% CI : (0.9782, 0.9855)
## No Information Rate : 0.9485
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.7832
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.66304
## Specificity : 0.99941
## Pos Pred Value : 0.98387
## Neg Pred Value : 0.98204
## Prevalence : 0.05145
## Detection Rate : 0.03412
## Detection Prevalence : 0.03468
## Balanced Accuracy : 0.83123
##
## 'Positive' Class : 1
##
cm_KNN <- confusionMatrix(as.factor(KNN_bin_prob), as.factor(job_scaled_test[,35]), positive = "1")
cm_KNN
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 5016 75
## 1 72 201
##
## Accuracy : 0.9726
## 95% CI : (0.9679, 0.9768)
## No Information Rate : 0.9485
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.7178
##
## Mcnemar's Test P-Value : 0.869
##
## Sensitivity : 0.72826
## Specificity : 0.98585
## Pos Pred Value : 0.73626
## Neg Pred Value : 0.98527
## Prevalence : 0.05145
## Detection Rate : 0.03747
## Detection Prevalence : 0.05089
## Balanced Accuracy : 0.85705
##
## 'Positive' Class : 1
##
cm_ANN <- confusionMatrix(as.factor(ANN_bin_pred), as.factor(job_scaled_test$fraudulent), positive = "1")
cm_ANN
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 5015 88
## 1 73 188
##
## Accuracy : 0.97
## 95% CI : (0.9651, 0.9744)
## No Information Rate : 0.9485
## P-Value [Acc > NIR] : 1.12e-14
##
## Kappa : 0.6844
##
## Mcnemar's Test P-Value : 0.2699
##
## Sensitivity : 0.68116
## Specificity : 0.98565
## Pos Pred Value : 0.72031
## Neg Pred Value : 0.98276
## Prevalence : 0.05145
## Detection Rate : 0.03505
## Detection Prevalence : 0.04866
## Balanced Accuracy : 0.83341
##
## 'Positive' Class : 1
##
cm_ANN_2 <- confusionMatrix(as.factor(ANN_bin_pred_2), as.factor(job_scaled_test$fraudulent), positive = "1")
cm_ANN_2
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 4970 90
## 1 118 186
##
## Accuracy : 0.9612
## 95% CI : (0.9557, 0.9662)
## No Information Rate : 0.9485
## P-Value [Acc > NIR] : 7.054e-06
##
## Kappa : 0.6209
##
## Mcnemar's Test P-Value : 0.06119
##
## Sensitivity : 0.67391
## Specificity : 0.97681
## Pos Pred Value : 0.61184
## Neg Pred Value : 0.98221
## Prevalence : 0.05145
## Detection Rate : 0.03468
## Detection Prevalence : 0.05667
## Balanced Accuracy : 0.82536
##
## 'Positive' Class : 1
##
cm_ANN_3 <- confusionMatrix(as.factor(ANN_bin_pred_3), as.factor(job_scaled_test$fraudulent), positive = "1")
cm_ANN_3
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 4976 79
## 1 112 197
##
## Accuracy : 0.9644
## 95% CI : (0.9591, 0.9692)
## No Information Rate : 0.9485
## P-Value [Acc > NIR] : 1.855e-08
##
## Kappa : 0.6547
##
## Mcnemar's Test P-Value : 0.02059
##
## Sensitivity : 0.71377
## Specificity : 0.97799
## Pos Pred Value : 0.63754
## Neg Pred Value : 0.98437
## Prevalence : 0.05145
## Detection Rate : 0.03673
## Detection Prevalence : 0.05761
## Balanced Accuracy : 0.84588
##
## 'Positive' Class : 1
##
cm_ANN_4 <- confusionMatrix(as.factor(ANN_bin_pred_4), as.factor(job_scaled_test$fraudulent), positive = "1")
cm_ANN_4
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 4955 82
## 1 133 194
##
## Accuracy : 0.9599
## 95% CI : (0.9543, 0.965)
## No Information Rate : 0.9485
## P-Value [Acc > NIR] : 5.39e-05
##
## Kappa : 0.6224
##
## Mcnemar's Test P-Value : 0.0006497
##
## Sensitivity : 0.70290
## Specificity : 0.97386
## Pos Pred Value : 0.59327
## Neg Pred Value : 0.98372
## Prevalence : 0.05145
## Detection Rate : 0.03617
## Detection Prevalence : 0.06096
## Balanced Accuracy : 0.83838
##
## 'Positive' Class : 1
##
ANN_Comparison <- data.frame(
"Model" = c("ANN_1", "ANN_2", "ANN_3", "ANN_4"),
"Accuracy" = c(round(cm_ANN$overall["Accuracy"], 4), round(cm_ANN_2$overall["Accuracy"], 4), round(cm_ANN_3$overall["Accuracy"], 4), round(cm_ANN_4$overall["Accuracy"], 4)),
"Sensitivity" = c(round(cm_ANN$byClass["Sensitivity"], 4), round(cm_ANN_2$byClass["Sensitivity"], 4), round(cm_ANN_3$byClass["Sensitivity"], 4), round(cm_ANN_4$byClass["Sensitivity"], 4)),
"Kappa" = c(round(cm_ANN$overall["Kappa"], 4), round(cm_ANN_2$overall["Kappa"], 4), round(cm_ANN_3$overall["Kappa"], 4), round(cm_ANN_4$overall["Kappa"])),
"P-Value" = c(round(cm_ANN$overall["AccuracyPValue"], 4), round(cm_ANN_2$overall["AccuracyPValue"], 4), round(cm_ANN_3$overall["AccuracyPValue"], 4), round(cm_ANN_4$overall["AccuracyPValue"], 4))
)
kable(ANN_Comparison, format = "markdown")
| Model | Accuracy | Sensitivity | Kappa | P.Value |
|---|---|---|---|---|
| ANN_1 | 0.9700 | 0.6812 | 0.6844 | 0e+00 |
| ANN_2 | 0.9612 | 0.6739 | 0.6209 | 0e+00 |
| ANN_3 | 0.9644 | 0.7138 | 0.6547 | 0e+00 |
| ANN_4 | 0.9599 | 0.7029 | 1.0000 | 1e-04 |
Looking at these 4 models, I will use ANN_3, as it has the highest sensitivity of 0.7138. However, this higher sensitivity does come at a trade-off of lower accuracy. I will use this model for the implementation step, along with feeding it into the stacked model.
# Raw data values
cm_unweight_stacked <- confusionMatrix(as.factor(stacked_unweighted_pred), as.factor(stacked_test$actual), positive = "1")
# Looking at the confusion matrix, we need to apply a cost matrix. In this situation, the false negatives are extremely costly. As such, we want to apply a cost matrix that weights the false negatives appropriately. I will apply a cost matrix that costs false negatives at 5:1 ratio to false positives to reduce the number of false negatives. However, this will increase the number of false positives. However, we are not as concerned about this as it is less costly to deal with jobs posts that are falsely flagged than fraudulent posts that are missed.
cm_unweight_stacked
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 2534 30
## 1 14 104
##
## Accuracy : 0.9836
## 95% CI : (0.978, 0.9881)
## No Information Rate : 0.95
## P-Value [Acc > NIR] : < 2e-16
##
## Kappa : 0.8168
##
## Mcnemar's Test P-Value : 0.02374
##
## Sensitivity : 0.77612
## Specificity : 0.99451
## Pos Pred Value : 0.88136
## Neg Pred Value : 0.98830
## Prevalence : 0.04996
## Detection Rate : 0.03878
## Detection Prevalence : 0.04400
## Balanced Accuracy : 0.88531
##
## 'Positive' Class : 1
##
cm_stacked <- confusionMatrix(as.factor(stacked_pred), as.factor(stacked_test$actual), positive = "1")
cm_stacked
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 2496 22
## 1 52 112
##
## Accuracy : 0.9724
## 95% CI : (0.9655, 0.9783)
## No Information Rate : 0.95
## P-Value [Acc > NIR] : 5.264e-09
##
## Kappa : 0.7372
##
## Mcnemar's Test P-Value : 0.0007485
##
## Sensitivity : 0.83582
## Specificity : 0.97959
## Pos Pred Value : 0.68293
## Neg Pred Value : 0.99126
## Prevalence : 0.04996
## Detection Rate : 0.04176
## Detection Prevalence : 0.06115
## Balanced Accuracy : 0.90771
##
## 'Positive' Class : 1
##
Model_Comparison <- data.frame(
"Model" = c("Logistic Regression", "Decision Tree (weights)", "SVM", "SVM_2", "Random Forest", "KNN", "ANN", "Stacked Model"),
"Accuracy" = c(round(cm_lr$overall["Accuracy"], 4),
round(cm_dt$overall["Accuracy"], 4),
round(cm_SVM$overall["Accuracy"], 4),
round(cm_SVM_2$overall["Accuracy"], 4),
round(cm_rf$overall["Accuracy"], 4),
round(cm_KNN$overall["Accuracy"], 4),
round(cm_ANN$overall["Accuracy"], 4),
round(cm_stacked$overall["Accuracy"], 4)),
"Sensitivity" = c(round(cm_lr$byClass["Sensitivity"], 4),
round(cm_dt$byClass["Sensitivity"], 4),
round(cm_SVM$byClass["Sensitivity"], 4),
round(cm_SVM_2$byClass["Sensitivity"], 4),
round(cm_rf$byClass["Sensitivity"], 4),
round(cm_KNN$byClass["Sensitivity"], 4),
round(cm_ANN$byClass["Sensitivity"], 4),
round(cm_stacked$byClass["Sensitivity"], 4)),
"Kappa" = c(round(cm_lr$overall["Kappa"], 4),
round(cm_dt$overall["Kappa"], 4),
round(cm_SVM$overall["Kappa"], 4),
round(cm_SVM_2$overall["Kappa"], 4),
round(cm_rf$overall["Kappa"], 4),
round(cm_KNN$overall["Kappa"], 4),
round(cm_ANN$overall["Kappa"], 4),
round(cm_stacked$overall["Kappa"], 4)),
"P-Value" = c(round(cm_lr$overall["AccuracyPValue"], 4),
round(cm_dt$overall["AccuracyPValue"], 4),
round(cm_SVM$overall["AccuracyPValue"], 4),
round(cm_SVM_2$overall["AccuracyPValue"], 4),
round(cm_rf$overall["AccuracyPValue"], 4),
round(cm_KNN$overall["AccuracyPValue"], 4),
round(cm_ANN$overall["AccuracyPValue"], 4),
round(cm_stacked$overall["AccuracyPValue"], 4))
)
kable(Model_Comparison, format = "markdown")
| Model | Accuracy | Sensitivity | Kappa | P.Value |
|---|---|---|---|---|
| Logistic Regression | 0.9629 | 0.5290 | 0.5756 | 0 |
| Decision Tree (weights) | 0.9631 | 0.8297 | 0.6790 | 0 |
| SVM | 0.9698 | 0.5435 | 0.6342 | 0 |
| SVM_2 | 0.5013 | 0.5507 | 0.0096 | 1 |
| Random Forest | 0.9821 | 0.6630 | 0.7832 | 0 |
| KNN | 0.9726 | 0.7283 | 0.7178 | 0 |
| ANN | 0.9700 | 0.6812 | 0.6844 | 0 |
| Stacked Model | 0.9724 | 0.8358 | 0.7372 | 0 |
Stacked Model
Comparing the stacked model to
the individual models, the stacked model has the third highest accuracy
(0.9724). It does have the highest sensitivity (0.8358). It has the
second highest kappa (just behind the Random Forest model). Lastly,
while it does not have the smallest p-value, it does have a p-value of
0, showing the model is significant.
Now that the models are created and evaluated, it is time to implement the models and see the financial impacts. It is important to also calculate the financial impact of having no model. I make assumptions (below) for the financial data I am missing.
fraud_identification_rate *
$1,000,000Note: Results will be scaled up to 100,000 posts so (they are comparable)
# Assumptions
fp_cost <- 35
fn_cost <- 500
num_posts = 100000
nm_scalar = num_posts/nrow(job)
m_scalar = num_posts/nrow(job_test)
sm_scalar = num_posts/nrow(stacked_test)
bonus <- 1000000
nm_frad = sum(job$fraudulent) * nm_scalar
nm_total_cost = nm_frad * fn_cost
With no model, there would be no way of flagging fraudulent posts ahead of time. As a result, we end up treating all posts as non-fraudulent. Therefore, we miss 4,843 (all) fraudulent posts, costing $2,421,700.
lr_fp_cost = fp_cost * cm_lr$table["1", "0"] * m_scalar
lr_fn_cost = fn_cost * cm_lr$table["0", "1"] * m_scalar
lr_suc_rate = (cm_lr$table["1", "1"] * m_scalar) / nm_frad
lr_bonus = bonus * lr_suc_rate
lr_total_cost = lr_fp_cost + lr_fn_cost - lr_bonus
With a model, we ended up with 1,286 false positives, costing $45,022.37. We ended up with 2,424 false negatives, costing $1,211,782. This model has a fraud success identification rate of 0.56, resulting in a benefit of $561,970.8. So, the total cost the job posting site company incurs is $694,833.9.
dt_fp_cost = fp_cost * cm_dt$table["1", "0"] * m_scalar
dt_fn_cost = fn_cost * cm_dt$table["0", "1"] * m_scalar
dt_suc_rate = (cm_dt$table["1", "1"] * m_scalar) / nm_frad
dt_bonus = bonus * dt_suc_rate
dt_total_cost = dt_fp_cost + dt_fn_cost - dt_bonus
With a model, we ended up with 2,815 false positives, costing $98,527.22. We ended up with 876 false negatives, costing $438,105.9. This model has a fraud success identification rate of 0.88, resulting in a benefit of $881,447.3. So, the total cost the job posting site company incurs is $-344,814.2.
SVM_fp_cost = fp_cost * cm_SVM$table["1", "0"] * m_scalar
SVM_fn_cost = fn_cost * cm_SVM$table["0", "1"] * m_scalar
SVM_suc_rate = (cm_SVM$table["1", "1"] * m_scalar) / nm_frad
SVM_bonus = bonus * SVM_suc_rate
SVM_total_cost = SVM_fp_cost + SVM_fn_cost - SVM_bonus
With a model, we ended up with 671 false positives, costing $23,489.93. We ended up with 2,349 false negatives, costing $1,174,497. This model has a fraud success identification rate of 0.58, resulting in a benefit of $577,367.2. So, the total cost the job posting site company incurs is $620,619.4.
SVM_fp_cost_2 = fp_cost * cm_SVM_2$table["1", "0"] * m_scalar
SVM_fn_cost_2 = fn_cost * cm_SVM_2$table["0", "1"] * m_scalar
SVM_suc_rate_2 = (cm_SVM_2$table["1", "1"] * m_scalar) / nm_frad
SVM_bonus_2 = bonus * SVM_suc_rate_2
SVM_total_cost_2 = SVM_fp_cost_2 + SVM_fn_cost_2 - SVM_bonus_2
With a model, we ended up with 47,558 false positives, costing $1,664,523. We ended up with 2,312 false negatives, costing $1,155,854. This model has a fraud success identification rate of 0.59, resulting in a benefit of $585,065.4. So, the total cost the job posting site company incurs is $2,235,311.
rf_fp_cost = fp_cost * cm_rf$table["1", "0"] * m_scalar
rf_fn_cost = fn_cost * cm_rf$table["0", "1"] * m_scalar
rf_suc_rate = (cm_rf$table["1", "1"] * m_scalar) / nm_frad
rf_bonus = bonus * rf_suc_rate
rf_total_cost = rf_fp_cost + rf_fn_cost - rf_bonus
With a model, we ended up with 56 false positives, costing $1,957.49. We ended up with 1,734 false negatives, costing $866,890.4. This model has a fraud success identification rate of 0.7, resulting in a benefit of $704,388. So, the total cost the job posting site company incurs is $164,459.9.
KNN_fp_cost = fp_cost * cm_KNN$table["1", "0"] * m_scalar
KNN_fn_cost = fn_cost * cm_KNN$table["0", "1"] * m_scalar
KNN_suc_rate = (cm_KNN$table["1", "1"] * m_scalar) / nm_frad
KNN_bonus = bonus * KNN_suc_rate
KNN_total_cost = KNN_fp_cost + KNN_fn_cost - KNN_bonus
With a model, we ended up with 1,342 false positives, costing $46,979.87. We ended up with 1,398 false negatives, costing $699,105.2. This model has a fraud success identification rate of 0.77, resulting in a benefit of $773,672.1. So, the total cost the job posting site company incurs is $-27,587.04.
ANN_fp_cost = fp_cost * cm_ANN_3$table["1", "0"] * m_scalar
ANN_fn_cost = fn_cost * cm_ANN_3$table["0", "1"] * m_scalar
ANN_suc_rate = (cm_ANN$table["1", "1"] * m_scalar) / nm_frad
ANN_bonus = bonus * ANN_suc_rate
ANN_total_cost = ANN_fp_cost + ANN_fn_cost - ANN_bonus
With a model, we ended up with 2,088 false positives, costing $73,079.79. We ended up with 1,473 false negatives, costing $736,390.8. This model has a fraud success identification rate of 0.72, resulting in a benefit of $723,633.6. So, the total cost the job posting site company incurs is $85,836.98.
stacked_fp_cost = fp_cost * cm_stacked$table["1", "0"] * sm_scalar
stacked_fn_cost = fn_cost * cm_stacked$table["0", "1"] * sm_scalar
stacked_suc_rate = (cm_stacked$table["1", "1"] * sm_scalar) / nm_frad
stacked_bonus = bonus * stacked_suc_rate
stacked_total_cost = stacked_fp_cost + stacked_fn_cost - stacked_bonus
With a model, we ended up with 1,939 false positives, costing $67,859.81. We ended up with 820 false negatives, costing $410,141.7. This model has a fraud success identification rate of 0.86, resulting in a benefit of $862,201.7. So, the total cost the job posting site company incurs is $-384,200.2.
results <- data.frame(
"Model" = c("No Model", "Logistic Regression", "Decision Tree", "SVM", "SVM_2", "Random Forest", "KNN", "ANN", "Stacked Model"),
"Accuracy" = c(0, round(cm_lr$overall["Accuracy"], 4),
round(cm_dt$overall["Accuracy"], 4),
round(cm_SVM$overall["Accuracy"], 4),
round(cm_SVM_2$overall["Accuracy"], 4),
round(cm_rf$overall["Accuracy"], 4),
round(cm_KNN$overall["Accuracy"], 4),
round(cm_ANN$overall["Accuracy"], 4),
round(cm_stacked$overall["Accuracy"], 4)),
"Sensitivity" = c(0, round(cm_lr$byClass["Sensitivity"], 4),
round(cm_dt$byClass["Sensitivity"], 4),
round(cm_SVM$byClass["Sensitivity"], 4),
round(cm_SVM_2$byClass["Sensitivity"], 4),
round(cm_rf$byClass["Sensitivity"], 4),
round(cm_KNN$byClass["Sensitivity"], 4),
round(cm_ANN$byClass["Sensitivity"], 4), round(cm_stacked$byClass["Sensitivity"], 4))
)
results$FP_Cost <- c(0, lr_fp_cost, dt_fp_cost, SVM_fp_cost, SVM_fp_cost_2, rf_fp_cost, KNN_fp_cost, ANN_fp_cost, stacked_fp_cost)
results$FN_Cost <- c(nm_total_cost, lr_fn_cost, dt_fn_cost, SVM_fn_cost, SVM_fn_cost_2, rf_fn_cost, KNN_fn_cost, ANN_fn_cost, stacked_fn_cost)
results$Benefit <- c(0, lr_bonus, dt_bonus, SVM_bonus, SVM_bonus_2, rf_bonus, KNN_bonus, ANN_bonus, stacked_bonus)
results$Total_Cost <- c(nm_total_cost, lr_total_cost, dt_total_cost, SVM_total_cost, SVM_total_cost_2, rf_total_cost, KNN_total_cost, ANN_total_cost, stacked_total_cost)
results$Cost_Savings <- (nm_total_cost - results$Total_Cost)
results$FP_Cost <- format(round(results$FP_Cost, 2), big.mark = ",")
results$FN_Cost <- format(round(results$FN_Cost, 2), big.mark = ",")
results$Benefit <- format(round(results$Benefit, 2), big.mark = ",")
results$Total_Cost <- format(round(results$Total_Cost, 2), big.mark = ",")
results$Cost_Savings <- format(round(results$Cost_Savings, 2), big.mark = ",")
kable(results, format = "markdown", digits = 4)
| Model | Accuracy | Sensitivity | FP_Cost | FN_Cost | Benefit | Total_Cost | Cost_Savings |
|---|---|---|---|---|---|---|---|
| No Model | 0.0000 | 0.0000 | 0.00 | 2,421,700.2 | 0.0 | 2,421,700.22 | 0.0 |
| Logistic Regression | 0.9629 | 0.5290 | 45,022.37 | 1,211,782.2 | 561,970.8 | 694,833.88 | 1,726,866.4 |
| Decision Tree | 0.9631 | 0.8297 | 98,527.22 | 438,105.9 | 881,447.3 | -344,814.16 | 2,766,514.4 |
| SVM | 0.9698 | 0.5435 | 23,489.93 | 1,174,496.6 | 577,367.2 | 620,619.37 | 1,801,080.9 |
| SVM_2 | 0.5013 | 0.5507 | 1,664,522.74 | 1,155,853.8 | 585,065.4 | 2,235,311.15 | 186,389.1 |
| Random Forest | 0.9821 | 0.6630 | 1,957.49 | 866,890.4 | 704,388.0 | 164,459.88 | 2,257,240.3 |
| KNN | 0.9726 | 0.7283 | 46,979.87 | 699,105.2 | 773,672.1 | -27,587.04 | 2,449,287.3 |
| ANN | 0.9700 | 0.6812 | 73,079.79 | 736,390.8 | 723,633.6 | 85,836.98 | 2,335,863.2 |
| Stacked Model | 0.9724 | 0.8358 | 67,859.81 | 410,141.7 | 862,201.7 | -384,200.20 | 2,805,900.4 |